0% found this document useful (0 votes)
2 views

Module II

This document provides an overview of regular expressions (regex) and their key functions, including meta-characters, pattern objects, and methods for matching and searching text. It also covers string modification methods like split, sub, and subn, as well as text processing concepts using spaCy, such as tokenization, counting words, and vocabulary extraction. Finally, it outlines a step-by-step guide for sentiment classification using a review dataset, detailing data preparation, vocabulary building, and encoding.

Uploaded by

Remya Anish
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module II

This document provides an overview of regular expressions (regex) and their key functions, including meta-characters, pattern objects, and methods for matching and searching text. It also covers string modification methods like split, sub, and subn, as well as text processing concepts using spaCy, such as tokenization, counting words, and vocabulary extraction. Finally, it outlines a step-by-step guide for sentiment classification using a review dataset, detailing data preparation, vocabulary building, and encoding.

Uploaded by

Remya Anish
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Here’s a brief note on common regex functions and their key meta-characters:

Meta-Characters in Regex

1. Big Brackets (()):


o Used for grouping expressions or capturing groups.
 Example: (abc) matches "abc".
o Can extract specific parts of a match.
 Example: (a|b)c matches "ac" or "bc".
2. Caret (^):
o When at the start, matches the beginning of a string.
 Example: ^abc matches "abc" only if it’s at the start.
o Inside square brackets [^...], it negates the character set.
 Example: [^abc] matches any character except "a", "b", or "c".
3. Backslash (\):
o Used for escaping special characters (e.g., \. to match a literal period).
o Indicates special sequences like:
 \d → Digits (0-9).
 \D → Non-digits.
 \w → Word characters (letters, digits, underscore).
 \W → Non-word characters.
 \s → Whitespace characters.
 \S → Non-whitespace characters.
4. Square Brackets ([]):
o Used to define a character set. Matches any single character inside.
 Example: [abc] matches "a", "b", or "c".
o Can define ranges.
 Example: [a-z] matches any lowercase letter.
5. Special Sequences:
o Shortcuts for common patterns using a backslash. Examples include:
 \b → Word boundary.
 \B → Non-word boundary.
 \A → Start of the string.
 \Z → End of the string.
6. Asterisk (*):
o Matches zero or more occurrences of the preceding character or group.
 Example: ab* matches "a", "ab", "abb", etc.
7. Plus (+):
o Matches one or more occurrences of the preceding character or group.
 Example: ab+ matches "ab", "abb", etc., but not "a".
8. Question Mark (?):
o Matches zero or one occurrence of the preceding character or group.
 Example: ab? matches "a" or "ab".
o Used for lazy matching (non-greedy).
 Example: a+? matches the shortest "a".
9. Curly Brackets ({}):
o Used to define a specific number of occurrences.
 Example: a{2} matches "aa".
o Ranges can also be defined:
 {m,n} matches between m and n times.
 {m,} matches m or more times.
 {,n} matches up to n times.

Summary Table

Meta-Character Function Example Matches


() Grouping (ab)+ "ab", "abab"
^ Start of string ^abc "abc" (start only)
\ Escaping/Special Sequences \d, \. Digits, literal "."
[] Character set [a-c] "a", "b", or "c"
* Zero or more ab* "a", "ab", "abb"
+ One or more ab+ "ab", "abb"
? Zero or one, or lazy matching ab? or a+? "a", "ab" / shortest "a"
{} Specific or range occurrences a{2,3} "aa", "aaa"

Pattern Objects in Regex

A pattern object is a compiled representation of a regular expression. It is created using the


re.compile() function and provides methods to match or search for text efficiently.

Example:

import re
pattern = re.compile(r'\d+') # Compiles a regex for matching one or more
digits

Match vs. Search Methods

1. match() Method:
o Tries to match the pattern only at the beginning of the string.
o Returns a match object if successful, else None.
o Example:
o result = pattern.match("123abc")
o print(result.group()) # Output: "123"

2. search() Method:
o Scans the entire string and returns the first match found.
o Returns a match object if successful, else None.
o Example:
o result = pattern.search("abc123xyz")
o print(result.group()) # Output: "123"

finditer() Method

 Finds all non-overlapping matches of the pattern in the string.


 Returns an iterator of match objects.
 Useful for iterating over multiple matches and extracting details like position.
 Example:
 matches = pattern.finditer("123abc456")
 for match in matches:
 print(match.group(), match.start(), match.end())
 # Output:
 # 123 0 3
 # 456 6 9

Logical OR (|)

 Matches either of the patterns separated by the pipe (|).


 Example:
 pattern = re.compile(r'apple|banana')
 result = pattern.search("I like banana")
 print(result.group()) # Output: "banana"

Beginning and End Patterns

1. Caret (^):
o Matches the beginning of a string.
 Example: ^abc matches "abc" only if it’s at the start.

2. Dollar Sign ($):


o Matches the end of a string.
 Example: xyz$ matches "xyz" only if it’s at the end.

3. Combine both to match entire string:


o Example: ^abc$ matches "abc" if it’s the whole string.

Parentheses (())

1. Grouping:
o Groups part of the regex for operations like quantifiers or logical OR.
o Example: (abc)+ matches "abc", "abcabc", etc.

2. Capturing:
o Captures the matched substring for later use.
o Example:
o pattern = re.compile(r'(\d+)-(\w+)')
o result = pattern.search("123-abc")
o print(result.group(1)) # Output: "123"
o print(result.group(2)) # Output: "abc"

3. Non-Capturing Groups:
o Use (?:...) to group without capturing.
o Example: (?:abc)+ groups "abc" but does not create a capture group.

Quick Comparison Table

Method/Feature Purpose Example Result

match() Match at the beginning re.match(r'\d+', '123abc') Match: "123"


Method/Feature Purpose Example Result

Find first match re.search(r'\d+',


search()
'abc123xyz') Match: "123"
anywhere

Find all matches with re.finditer(r'\d+', "123" at 0–3, "456"


finditer()
positions '123abc456') at 6–9

Logical OR (` `) Matches one of the alternatives `r'apple

Beginning (^) and end


^ and $ ^abc$ Matches "abc" only
($) of string

Matches "ab",
Parentheses (()) Grouping and capturing (ab)+
"abab", etc.

Here's a brief explanation of string modification methods in regex: split, sub, and subn.

1. split() Method

 Purpose: Splits a string into a list using a regex pattern as the delimiter.
 Syntax: re.split(pattern, string, maxsplit=0)
o pattern: Regex pattern used for splitting.
o string: Input string to split.
o maxsplit: Maximum number of splits (default is 0, which means no limit).
 Example:
 import re
 result = re.split(r'\s+', 'This is a test string')
 print(result)
 # Output: ['This', 'is', 'a', 'test', 'string']
 Advanced Example:
 result = re.split(r'[.,]', 'apple,orange.banana')
 print(result)
 # Output: ['apple', 'orange', 'banana']

2. sub() Method

 Purpose: Replaces all occurrences of a pattern in a string with a specified


replacement.
 Syntax: re.sub(pattern, replacement, string, count=0)
o pattern: Regex pattern to find.
o replacement: Replacement string.
o string: Input string.
o count: Maximum number of replacements (default is 0, which means replace
all).
 Example:
 result = re.sub(r'\d+', '#', 'Order123 and Order456')
 print(result)
 # Output: 'Order# and Order#'
 Advanced Example:
 result = re.sub(r'(\d+)', r'[\1]', 'Item 123, Item 456')
 print(result)
 # Output: 'Item [123], Item [456]'

3. subn() Method

 Purpose: Works like sub() but returns a tuple containing:


1. The modified string.
2. The number of replacements made.
 Syntax: re.subn(pattern, replacement, string, count=0)
 Example:
 result = re.subn(r'\d+', '#', 'Order123 and Order456')
 print(result)
 # Output: ('Order# and Order#', 2)
 Advanced Example:
 result = re.subn(r'[aeiou]', '*', 'banana')
 print(result)
 # Output: ('b*n*n*', 3)

Key Differences

Method Purpose Output


split() Splits the string into a list using regex List of substrings
sub() Replaces matches with a replacement Modified string
subn() Like sub() but also returns count Tuple: (modified string, count)

Here’s a concise explanation of text processing concepts and how they are handled using
spaCy:

Key Concepts in Text Processing

1. Words:
o Basic units of text.
o Words are typically separated by spaces or punctuation in a sentence.
2. Tokens:
o A token is a segment of text (e.g., words, punctuation, or symbols).
o Tokenization is the process of breaking text into these segments.
3. Counting Words:
o Determining the number of words or tokens in a text.
o Includes removing duplicates (for unique words) or punctuation depending on
requirements.
4. Vocabulary:
o The set of unique words in a text or corpus.
o Often built after preprocessing (e.g., converting text to lowercase, removing
stopwords).
5. Corpus:
o A collection of texts used for analysis.
o Can be a single document or a large dataset (e.g., Wikipedia articles).
6. Tokenization:
o Splitting text into smaller units (tokens), such as words or subwords.
o spaCy provides efficient and accurate tokenization that respects language
rules.

Text Processing with spaCy

1. Tokenization in spaCy

 Basic Tokenization:
 import spacy
 nlp = spacy.load("en_core_web_sm") # Load spaCy model
 doc = nlp("This is a sample sentence!")
 for token in doc:
 print(token.text)
 # Output:
 # This
 # is
 # a
 # sample
 # sentence
 # !
 Customizing Tokenization:
o SpaCy allows customizing tokenization rules using nlp.tokenizer.

2. Counting Words

 Total tokens in a document:


 print(len(doc)) # Total number of tokens
 Count specific word occurrences:
 from collections import Counter
 word_counts = Counter([token.text for token in doc])
 print(word_counts)
 # Example Output: Counter({'is': 1, 'This': 1, 'a': 1, 'sample': 1,
'sentence': 1, '!': 1})

3. Vocabulary Extraction

 Extracting unique words:


 vocabulary = set([token.text.lower() for token in doc if
token.is_alpha])
 print(vocabulary)
 # Example Output: {'this', 'is', 'a', 'sample', 'sentence'}

4. Corpus Handling
 Apply spaCy to a corpus (multiple documents):
 texts = ["This is the first document.", "This is the second."]
 docs = [nlp(text) for text in texts]
 for doc in docs:
 print([token.text for token in doc])

5. Advanced Token Attributes

 SpaCy tokens provide rich linguistic information:


 for token in doc:
 print(f"Text: {token.text}, Lemma: {token.lemma_}, POS:
{token.pos_}")
 # Output:
 # Text: This, Lemma: this, POS: DET
 # Text: is, Lemma: be, POS: AUX
 # Text: a, Lemma: a, POS: DET
 # Text: sample, Lemma: sample, POS: NOUN
 # Text: sentence, Lemma: sentence, POS: NOUN
 # Text: !, Lemma: !, POS: PUNCT

Workflow Example

1. Preprocessing:
o Lowercasing, punctuation removal, stopword removal.
2. Tokenization:
o Break text into tokens (words, punctuation, etc.).
3. Vocabulary Building:
o Collect unique tokens.
4. Word Counting:
o Count occurrences of words/tokens.

Summary Table

Concept Description Example with spaCy


Words Basic text units "word1 word2" → ['word1', 'word2']
Tokens Segments of text "word1, word2!" → ['word1', ',', 'word2', '!']
Counting Counting occurrences of Counter({'word1': 1, 'word2': 1})
Words tokens/words
Vocabulary Unique set of words {'word1', 'word2'}
["doc1 text", "doc2 text"] → Tokenize
Corpus Collection of texts
each separately.
spaCy doc = nlp("This is a test.") →
Tokenization Breaking text into tokens
Tokens: ['This', 'is', 'a', 'test', '.']
Here's a step-by-step guide for performing sentiment classification using a review dataset
(e.g., Yelp), including data preparation, vocabulary building, encoding, and evaluation.

Steps for Sentiment Classification


1. Download and Load Dataset
 Use a Yelp review dataset or any text review dataset. Assume it contains two
columns: text (reviews) and sentiment (labels: positive/negative).
import pandas as pd

# Load dataset
url = "https://example.com/yelp_reviews.csv" # Replace with your dataset URL or path
df = pd.read_csv(url)

# Inspect data
print(df.head())
# Output:
# text sentiment
# 0 "Great food and service!" positive
# 1 "Terrible experience!" negative

2. Data Preparation
1. Cleaning and Tokenization:
o Remove special characters, lowercasing, and tokenization.
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

def preprocess(text):
# Lowercase and remove special characters
text = re.sub(r'[^\w\s]', '', text.lower())
# Tokenize and remove stopwords
tokens = [word for word in word_tokenize(text) if word not in stop_words]
return tokens
df['tokens'] = df['text'].apply(preprocess)
print(df.head())
2. Build Vocabulary:
o Use tokens from the entire dataset to construct a vocabulary.
from collections import Counter

# Flatten tokens and count word frequencies


all_tokens = [token for tokens in df['tokens'] for token in tokens]
vocab = Counter(all_tokens)

# Build a vocabulary dictionary


vocab_dict = {word: idx + 1 for idx, (word, _) in enumerate(vocab.items())}
print(f"Vocabulary size: {len(vocab_dict)}")

3. Encode Data
1. One-Hot Encoding:
o Convert tokens to one-hot vectors based on the vocabulary.
def one_hot_encode(tokens, vocab_dict):
return [vocab_dict[word] for word in tokens if word in vocab_dict]

df['encoded'] = df['tokens'].apply(lambda x: one_hot_encode(x, vocab_dict))


print(df.head())
2. Padding Sequences:
o Ensure all sequences have the same length.
from tensorflow.keras.preprocessing.sequence import pad_sequences

max_length = 50 # Define max length for reviews


df['padded'] = pad_sequences(df['encoded'], maxlen=max_length, padding='post').tolist()
4. Train-Test Split
from sklearn.model_selection import train_test_split
import numpy as np

# Convert labels to numeric


df['label'] = df['sentiment'].map({'positive': 1, 'negative': 0})

# Split into features (X) and target (y)


X = np.array(df['padded'].tolist())
y = np.array(df['label'])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

5. Feature Computation and Model Training


1. Train a Model:
o Use a simple logistic regression or an ML model.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

# Flatten features for models that expect 2D input


X_train_flat = X_train.reshape(X_train.shape[0], -1)
X_test_flat = X_test.reshape(X_test.shape[0], -1)

# Train a logistic regression model


model = LogisticRegression(max_iter=1000)
model.fit(X_train_flat, y_train)

# Predict on test data


y_pred = model.predict(X_test_flat)
6. Evaluation
1. Confusion Matrix:
o Analyze model performance.
from sklearn.metrics import ConfusionMatrixDisplay

# Compute confusion matrix


conf_matrix = confusion_matrix(y_test, y_pred)

# Display confusion matrix


ConfusionMatrixDisplay(conf_matrix, display_labels=['Negative', 'Positive']).plot()
2. Performance Metrics:
o Classification report provides precision, recall, F1-score.
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))

Key Insights from Analysis


1. Confusion Matrix:
o Diagonal values show correct predictions.
o Off-diagonal values indicate misclassifications.
2. Precision, Recall, F1-Score:
o Assess how well the model handles imbalanced data.
3. Vocabulary and Feature Engineering:
o Larger vocabularies can increase accuracy but might overfit. Experiment with
stopword removal or stemming.

Language-Independent Tokenization
Tokenization is a crucial step in natural language processing (NLP) to break down text into
smaller units (tokens). The process is tailored to different use cases and text structures, often
depending on language-specific or task-specific requirements.

1. Types of Tokenization
a. Word Tokenization
 Splits text into words or word-like units.
 Commonly used for tasks like machine translation, sentiment analysis, and text
classification.
Example:
Input: "Tokenization is crucial!"
Output: ["Tokenization", "is", "crucial", "!"]
Problems:
o Language dependence: Rules for word boundaries differ across languages.
 Example: In Chinese or Japanese, words are not space-separated.
o Out-of-vocabulary (OOV) words: Fails for unknown or rare words.
 Example: "unfathomably" won't match a pre-trained vocabulary.
o Ambiguity: Hyphenated words or contractions may be split incorrectly.

b. Character Tokenization
 Splits text into individual characters.
 Language-agnostic and works well for morphologically rich or space-free languages.
Example:
Input: "Tokenization"
Output: ["T", "o", "k", "e", "n", "i", "z", "a", "t", "i", "o", "n"]
Drawbacks:
o Longer sequences: Increases sequence length, making models
computationally expensive.
o Loss of semantic structure: Characters alone don’t capture meaningful
information.
 Example: The meaning of "word" is lost when broken into ["w", "o",
"r", "d"].
o Difficulty in learning dependencies: Requires more training data for
meaningful patterns.

c. Sub-Word Tokenization
 Breaks words into smaller units (sub-words) based on frequency patterns.
 Strikes a balance between word and character tokenization.
 Common methods include Byte Pair Encoding (BPE) and WordPiece.
Example (BPE):
Input: "unbelievable"
Output: ["un", "believ", "able"]
Problems:
o Complexity: Requires pre-processing to build sub-word vocabularies.
o Language nuances: Morphologically rich languages may still require
additional handling.

2. Byte Pair Encoding (BPE)


What is BPE?
 A sub-word tokenization algorithm based on merging the most frequent pairs of
characters or sub-words iteratively until a vocabulary size is reached.
 Frequently used in NLP models like GPT and BERT.
Steps:
1. Start with character-level tokens.
2. "unbelievable" → ["u", "n", "b", "e", "l", "i", "e", "v", "a", "b", "l", "e"]
3. Identify the most frequent adjacent pairs and merge them.
4. "b", "e" → "be" → ["u", "n", "be", "l", "i", "e", "v", "a", "b", "l", "e"]
5. Repeat until the vocabulary size is reached.
Advantages:
o Reduces OOV problems by handling rare words as sub-words.
o Efficient storage and processing compared to word-level tokenization.
o Language-independent.
Challenges:
o Training is computationally expensive.
o May split some meaningful words into less interpretable sub-words.

Comparison of Tokenization Methods


Word Character Sub-Word Tokenization
Aspect
Tokenization Tokenization (BPE)

Granularity Words Characters Sub-words

OOV Handling Poor Excellent Good

Sequence Length Moderate Long Short to Moderate

Semantic Retention Good Poor Moderate

Language
High Low Moderate
Dependence

Computational Cost Low High Moderate

Conclusion
 Word Tokenization: Works well for languages with space-separated words but
struggles with OOV and morphological variation.
 Character Tokenization: Best for language-agnostic tasks but computationally
expensive and loses semantic structure.
 Sub-Word Tokenization (BPE): A practical middle-ground solution widely used in
modern NLP.
String Matching and Spelling Correction: Minimum Edit Distance
Minimum Edit Distance is the minimum number of operations required to convert one
string into another. Common operations include:
1. Insertion: Add a character.
2. Deletion: Remove a character.
3. Substitution: Replace one character with another.
This is widely used in:
 Spelling correction
 DNA sequence analysis
 Natural language processing tasks

Dynamic Programming Approach


Dynamic programming is used to calculate the minimum edit distance efficiently by filling
a table (dp) to compute the result iteratively.
Steps to Calculate Minimum Edit Distance
1. Define the Problem
Let the two strings be:
 s1 (source string) of length m
 s2 (target string) of length n
We need to compute the cost of converting s1 into s2.

2. Define the Table


 Create a 2D table dp[m+1][n+1].
 dp[i][j] represents the minimum edit distance to convert the first i characters of s1 into
the first j characters of s2.

3. Initialization
1. If one string is empty, the cost is the length of the other string.
o dp[i][0] = i (Cost of deleting all characters from s1)
o dp[0][j] = j (Cost of inserting all characters into s1)

4. Recurrence Relation
For every character in s1 and s2, compare:
1. If characters are the same:
o No cost: dp[i][j] = dp[i-1][j-1]
2. If characters are different:
o Take the minimum of:
 Insertion: dp[i][j-1] + 1
 Deletion: dp[i-1][j] + 1
 Substitution: dp[i-1][j-1] + 1
Formula:
dp[i][j] = min(
dp[i-1][j] + 1, # Deletion
dp[i][j-1] + 1, # Insertion
dp[i-1][j-1] + cost # Substitution (cost = 0 if s1[i-1] == s2[j-1], else 1)
)

5. Algorithm
def minimum_edit_distance(s1, s2):
m, n = len(s1), len(s2)
# Initialize DP table
dp = [[0 for _ in range(n + 1)] for _ in range(m + 1)]

# Fill base cases


for i in range(m + 1):
dp[i][0] = i # Cost of deleting all characters
for j in range(n + 1):
dp[0][j] = j # Cost of inserting all characters

# Fill the table


for i in range(1, m + 1):
for j in range(1, n + 1):
if s1[i - 1] == s2[j - 1]: # Characters match
dp[i][j] = dp[i - 1][j - 1]
else:
dp[i][j] = min(
dp[i - 1][j] + 1, # Deletion
dp[i][j - 1] + 1, # Insertion
dp[i - 1][j - 1] + 1 # Substitution
)

# Minimum edit distance is in the bottom-right corner


return dp[m][n]

# Example Usage
s1 = "kitten"
s2 = "sitting"
print("Minimum Edit Distance:", minimum_edit_distance(s1, s2))
# Output: Minimum Edit Distance: 3

6. Table Filling Example


For s1 = "kitten" and s2 = "sitting":
s i t t i ng

01234567

k11234567

i 22123456

t 33212345

t 44321234

e 55432234

n66543323

7. Analysis
1. Time Complexity:
o O(m⋅n)O(m \cdot n), where mm and nn are the lengths of the strings.
2. Space Complexity:
o O(m⋅n)O(m \cdot n) for the table.
3. Applications:
o Spell checkers (suggesting closest corrections).
o DNA sequence alignment (biological string matching).
o Plagiarism detection.

You might also like