Here’s a brief note on common regex functions and their key meta-characters:
Meta-Characters in Regex
1. Big Brackets (()):
o Used for grouping expressions or capturing groups.
Example: (abc) matches "abc".
o Can extract specific parts of a match.
Example: (a|b)c matches "ac" or "bc".
2. Caret (^):
o When at the start, matches the beginning of a string.
Example: ^abc matches "abc" only if it’s at the start.
o Inside square brackets [^...], it negates the character set.
Example: [^abc] matches any character except "a", "b", or "c".
3. Backslash (\):
o Used for escaping special characters (e.g., \. to match a literal period).
o Indicates special sequences like:
\d → Digits (0-9).
\D → Non-digits.
\w → Word characters (letters, digits, underscore).
\W → Non-word characters.
\s → Whitespace characters.
\S → Non-whitespace characters.
4. Square Brackets ([]):
o Used to define a character set. Matches any single character inside.
Example: [abc] matches "a", "b", or "c".
o Can define ranges.
Example: [a-z] matches any lowercase letter.
5. Special Sequences:
o Shortcuts for common patterns using a backslash. Examples include:
\b → Word boundary.
\B → Non-word boundary.
\A → Start of the string.
\Z → End of the string.
6. Asterisk (*):
o Matches zero or more occurrences of the preceding character or group.
Example: ab* matches "a", "ab", "abb", etc.
7. Plus (+):
o Matches one or more occurrences of the preceding character or group.
Example: ab+ matches "ab", "abb", etc., but not "a".
8. Question Mark (?):
o Matches zero or one occurrence of the preceding character or group.
Example: ab? matches "a" or "ab".
o Used for lazy matching (non-greedy).
Example: a+? matches the shortest "a".
9. Curly Brackets ({}):
o Used to define a specific number of occurrences.
Example: a{2} matches "aa".
o Ranges can also be defined:
{m,n} matches between m and n times.
{m,} matches m or more times.
{,n} matches up to n times.
Summary Table
Meta-Character Function Example Matches
() Grouping (ab)+ "ab", "abab"
^ Start of string ^abc "abc" (start only)
\ Escaping/Special Sequences \d, \. Digits, literal "."
[] Character set [a-c] "a", "b", or "c"
* Zero or more ab* "a", "ab", "abb"
+ One or more ab+ "ab", "abb"
? Zero or one, or lazy matching ab? or a+? "a", "ab" / shortest "a"
{} Specific or range occurrences a{2,3} "aa", "aaa"
Pattern Objects in Regex
A pattern object is a compiled representation of a regular expression. It is created using the
re.compile() function and provides methods to match or search for text efficiently.
Example:
import re
pattern = re.compile(r'\d+') # Compiles a regex for matching one or more
digits
Match vs. Search Methods
1. match() Method:
o Tries to match the pattern only at the beginning of the string.
o Returns a match object if successful, else None.
o Example:
o result = pattern.match("123abc")
o print(result.group()) # Output: "123"
2. search() Method:
o Scans the entire string and returns the first match found.
o Returns a match object if successful, else None.
o Example:
o result = pattern.search("abc123xyz")
o print(result.group()) # Output: "123"
finditer() Method
Finds all non-overlapping matches of the pattern in the string.
Returns an iterator of match objects.
Useful for iterating over multiple matches and extracting details like position.
Example:
matches = pattern.finditer("123abc456")
for match in matches:
print(match.group(), match.start(), match.end())
# Output:
# 123 0 3
# 456 6 9
Logical OR (|)
Matches either of the patterns separated by the pipe (|).
Example:
pattern = re.compile(r'apple|banana')
result = pattern.search("I like banana")
print(result.group()) # Output: "banana"
Beginning and End Patterns
1. Caret (^):
o Matches the beginning of a string.
Example: ^abc matches "abc" only if it’s at the start.
2. Dollar Sign ($):
o Matches the end of a string.
Example: xyz$ matches "xyz" only if it’s at the end.
3. Combine both to match entire string:
o Example: ^abc$ matches "abc" if it’s the whole string.
Parentheses (())
1. Grouping:
o Groups part of the regex for operations like quantifiers or logical OR.
o Example: (abc)+ matches "abc", "abcabc", etc.
2. Capturing:
o Captures the matched substring for later use.
o Example:
o pattern = re.compile(r'(\d+)-(\w+)')
o result = pattern.search("123-abc")
o print(result.group(1)) # Output: "123"
o print(result.group(2)) # Output: "abc"
3. Non-Capturing Groups:
o Use (?:...) to group without capturing.
o Example: (?:abc)+ groups "abc" but does not create a capture group.
Quick Comparison Table
Method/Feature Purpose Example Result
match() Match at the beginning re.match(r'\d+', '123abc') Match: "123"
Method/Feature Purpose Example Result
Find first match re.search(r'\d+',
search()
'abc123xyz') Match: "123"
anywhere
Find all matches with re.finditer(r'\d+', "123" at 0–3, "456"
finditer()
positions '123abc456') at 6–9
Logical OR (` `) Matches one of the alternatives `r'apple
Beginning (^) and end
^ and $ ^abc$ Matches "abc" only
($) of string
Matches "ab",
Parentheses (()) Grouping and capturing (ab)+
"abab", etc.
Here's a brief explanation of string modification methods in regex: split, sub, and subn.
1. split() Method
Purpose: Splits a string into a list using a regex pattern as the delimiter.
Syntax: re.split(pattern, string, maxsplit=0)
o pattern: Regex pattern used for splitting.
o string: Input string to split.
o maxsplit: Maximum number of splits (default is 0, which means no limit).
Example:
import re
result = re.split(r'\s+', 'This is a test string')
print(result)
# Output: ['This', 'is', 'a', 'test', 'string']
Advanced Example:
result = re.split(r'[.,]', 'apple,orange.banana')
print(result)
# Output: ['apple', 'orange', 'banana']
2. sub() Method
Purpose: Replaces all occurrences of a pattern in a string with a specified
replacement.
Syntax: re.sub(pattern, replacement, string, count=0)
o pattern: Regex pattern to find.
o replacement: Replacement string.
o string: Input string.
o count: Maximum number of replacements (default is 0, which means replace
all).
Example:
result = re.sub(r'\d+', '#', 'Order123 and Order456')
print(result)
# Output: 'Order# and Order#'
Advanced Example:
result = re.sub(r'(\d+)', r'[\1]', 'Item 123, Item 456')
print(result)
# Output: 'Item [123], Item [456]'
3. subn() Method
Purpose: Works like sub() but returns a tuple containing:
1. The modified string.
2. The number of replacements made.
Syntax: re.subn(pattern, replacement, string, count=0)
Example:
result = re.subn(r'\d+', '#', 'Order123 and Order456')
print(result)
# Output: ('Order# and Order#', 2)
Advanced Example:
result = re.subn(r'[aeiou]', '*', 'banana')
print(result)
# Output: ('b*n*n*', 3)
Key Differences
Method Purpose Output
split() Splits the string into a list using regex List of substrings
sub() Replaces matches with a replacement Modified string
subn() Like sub() but also returns count Tuple: (modified string, count)
Here’s a concise explanation of text processing concepts and how they are handled using
spaCy:
Key Concepts in Text Processing
1. Words:
o Basic units of text.
o Words are typically separated by spaces or punctuation in a sentence.
2. Tokens:
o A token is a segment of text (e.g., words, punctuation, or symbols).
o Tokenization is the process of breaking text into these segments.
3. Counting Words:
o Determining the number of words or tokens in a text.
o Includes removing duplicates (for unique words) or punctuation depending on
requirements.
4. Vocabulary:
o The set of unique words in a text or corpus.
o Often built after preprocessing (e.g., converting text to lowercase, removing
stopwords).
5. Corpus:
o A collection of texts used for analysis.
o Can be a single document or a large dataset (e.g., Wikipedia articles).
6. Tokenization:
o Splitting text into smaller units (tokens), such as words or subwords.
o spaCy provides efficient and accurate tokenization that respects language
rules.
Text Processing with spaCy
1. Tokenization in spaCy
Basic Tokenization:
import spacy
nlp = spacy.load("en_core_web_sm") # Load spaCy model
doc = nlp("This is a sample sentence!")
for token in doc:
print(token.text)
# Output:
# This
# is
# a
# sample
# sentence
# !
Customizing Tokenization:
o SpaCy allows customizing tokenization rules using nlp.tokenizer.
2. Counting Words
Total tokens in a document:
print(len(doc)) # Total number of tokens
Count specific word occurrences:
from collections import Counter
word_counts = Counter([token.text for token in doc])
print(word_counts)
# Example Output: Counter({'is': 1, 'This': 1, 'a': 1, 'sample': 1,
'sentence': 1, '!': 1})
3. Vocabulary Extraction
Extracting unique words:
vocabulary = set([token.text.lower() for token in doc if
token.is_alpha])
print(vocabulary)
# Example Output: {'this', 'is', 'a', 'sample', 'sentence'}
4. Corpus Handling
Apply spaCy to a corpus (multiple documents):
texts = ["This is the first document.", "This is the second."]
docs = [nlp(text) for text in texts]
for doc in docs:
print([token.text for token in doc])
5. Advanced Token Attributes
SpaCy tokens provide rich linguistic information:
for token in doc:
print(f"Text: {token.text}, Lemma: {token.lemma_}, POS:
{token.pos_}")
# Output:
# Text: This, Lemma: this, POS: DET
# Text: is, Lemma: be, POS: AUX
# Text: a, Lemma: a, POS: DET
# Text: sample, Lemma: sample, POS: NOUN
# Text: sentence, Lemma: sentence, POS: NOUN
# Text: !, Lemma: !, POS: PUNCT
Workflow Example
1. Preprocessing:
o Lowercasing, punctuation removal, stopword removal.
2. Tokenization:
o Break text into tokens (words, punctuation, etc.).
3. Vocabulary Building:
o Collect unique tokens.
4. Word Counting:
o Count occurrences of words/tokens.
Summary Table
Concept Description Example with spaCy
Words Basic text units "word1 word2" → ['word1', 'word2']
Tokens Segments of text "word1, word2!" → ['word1', ',', 'word2', '!']
Counting Counting occurrences of Counter({'word1': 1, 'word2': 1})
Words tokens/words
Vocabulary Unique set of words {'word1', 'word2'}
["doc1 text", "doc2 text"] → Tokenize
Corpus Collection of texts
each separately.
spaCy doc = nlp("This is a test.") →
Tokenization Breaking text into tokens
Tokens: ['This', 'is', 'a', 'test', '.']
Here's a step-by-step guide for performing sentiment classification using a review dataset
(e.g., Yelp), including data preparation, vocabulary building, encoding, and evaluation.
Steps for Sentiment Classification
1. Download and Load Dataset
Use a Yelp review dataset or any text review dataset. Assume it contains two
columns: text (reviews) and sentiment (labels: positive/negative).
import pandas as pd
# Load dataset
url = "https://example.com/yelp_reviews.csv" # Replace with your dataset URL or path
df = pd.read_csv(url)
# Inspect data
print(df.head())
# Output:
# text sentiment
# 0 "Great food and service!" positive
# 1 "Terrible experience!" negative
2. Data Preparation
1. Cleaning and Tokenization:
o Remove special characters, lowercasing, and tokenization.
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
def preprocess(text):
# Lowercase and remove special characters
text = re.sub(r'[^\w\s]', '', text.lower())
# Tokenize and remove stopwords
tokens = [word for word in word_tokenize(text) if word not in stop_words]
return tokens
df['tokens'] = df['text'].apply(preprocess)
print(df.head())
2. Build Vocabulary:
o Use tokens from the entire dataset to construct a vocabulary.
from collections import Counter
# Flatten tokens and count word frequencies
all_tokens = [token for tokens in df['tokens'] for token in tokens]
vocab = Counter(all_tokens)
# Build a vocabulary dictionary
vocab_dict = {word: idx + 1 for idx, (word, _) in enumerate(vocab.items())}
print(f"Vocabulary size: {len(vocab_dict)}")
3. Encode Data
1. One-Hot Encoding:
o Convert tokens to one-hot vectors based on the vocabulary.
def one_hot_encode(tokens, vocab_dict):
return [vocab_dict[word] for word in tokens if word in vocab_dict]
df['encoded'] = df['tokens'].apply(lambda x: one_hot_encode(x, vocab_dict))
print(df.head())
2. Padding Sequences:
o Ensure all sequences have the same length.
from tensorflow.keras.preprocessing.sequence import pad_sequences
max_length = 50 # Define max length for reviews
df['padded'] = pad_sequences(df['encoded'], maxlen=max_length, padding='post').tolist()
4. Train-Test Split
from sklearn.model_selection import train_test_split
import numpy as np
# Convert labels to numeric
df['label'] = df['sentiment'].map({'positive': 1, 'negative': 0})
# Split into features (X) and target (y)
X = np.array(df['padded'].tolist())
y = np.array(df['label'])
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
5. Feature Computation and Model Training
1. Train a Model:
o Use a simple logistic regression or an ML model.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
# Flatten features for models that expect 2D input
X_train_flat = X_train.reshape(X_train.shape[0], -1)
X_test_flat = X_test.reshape(X_test.shape[0], -1)
# Train a logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train_flat, y_train)
# Predict on test data
y_pred = model.predict(X_test_flat)
6. Evaluation
1. Confusion Matrix:
o Analyze model performance.
from sklearn.metrics import ConfusionMatrixDisplay
# Compute confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
# Display confusion matrix
ConfusionMatrixDisplay(conf_matrix, display_labels=['Negative', 'Positive']).plot()
2. Performance Metrics:
o Classification report provides precision, recall, F1-score.
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))
Key Insights from Analysis
1. Confusion Matrix:
o Diagonal values show correct predictions.
o Off-diagonal values indicate misclassifications.
2. Precision, Recall, F1-Score:
o Assess how well the model handles imbalanced data.
3. Vocabulary and Feature Engineering:
o Larger vocabularies can increase accuracy but might overfit. Experiment with
stopword removal or stemming.
Language-Independent Tokenization
Tokenization is a crucial step in natural language processing (NLP) to break down text into
smaller units (tokens). The process is tailored to different use cases and text structures, often
depending on language-specific or task-specific requirements.
1. Types of Tokenization
a. Word Tokenization
Splits text into words or word-like units.
Commonly used for tasks like machine translation, sentiment analysis, and text
classification.
Example:
Input: "Tokenization is crucial!"
Output: ["Tokenization", "is", "crucial", "!"]
Problems:
o Language dependence: Rules for word boundaries differ across languages.
Example: In Chinese or Japanese, words are not space-separated.
o Out-of-vocabulary (OOV) words: Fails for unknown or rare words.
Example: "unfathomably" won't match a pre-trained vocabulary.
o Ambiguity: Hyphenated words or contractions may be split incorrectly.
b. Character Tokenization
Splits text into individual characters.
Language-agnostic and works well for morphologically rich or space-free languages.
Example:
Input: "Tokenization"
Output: ["T", "o", "k", "e", "n", "i", "z", "a", "t", "i", "o", "n"]
Drawbacks:
o Longer sequences: Increases sequence length, making models
computationally expensive.
o Loss of semantic structure: Characters alone don’t capture meaningful
information.
Example: The meaning of "word" is lost when broken into ["w", "o",
"r", "d"].
o Difficulty in learning dependencies: Requires more training data for
meaningful patterns.
c. Sub-Word Tokenization
Breaks words into smaller units (sub-words) based on frequency patterns.
Strikes a balance between word and character tokenization.
Common methods include Byte Pair Encoding (BPE) and WordPiece.
Example (BPE):
Input: "unbelievable"
Output: ["un", "believ", "able"]
Problems:
o Complexity: Requires pre-processing to build sub-word vocabularies.
o Language nuances: Morphologically rich languages may still require
additional handling.
2. Byte Pair Encoding (BPE)
What is BPE?
A sub-word tokenization algorithm based on merging the most frequent pairs of
characters or sub-words iteratively until a vocabulary size is reached.
Frequently used in NLP models like GPT and BERT.
Steps:
1. Start with character-level tokens.
2. "unbelievable" → ["u", "n", "b", "e", "l", "i", "e", "v", "a", "b", "l", "e"]
3. Identify the most frequent adjacent pairs and merge them.
4. "b", "e" → "be" → ["u", "n", "be", "l", "i", "e", "v", "a", "b", "l", "e"]
5. Repeat until the vocabulary size is reached.
Advantages:
o Reduces OOV problems by handling rare words as sub-words.
o Efficient storage and processing compared to word-level tokenization.
o Language-independent.
Challenges:
o Training is computationally expensive.
o May split some meaningful words into less interpretable sub-words.
Comparison of Tokenization Methods
Word Character Sub-Word Tokenization
Aspect
Tokenization Tokenization (BPE)
Granularity Words Characters Sub-words
OOV Handling Poor Excellent Good
Sequence Length Moderate Long Short to Moderate
Semantic Retention Good Poor Moderate
Language
High Low Moderate
Dependence
Computational Cost Low High Moderate
Conclusion
Word Tokenization: Works well for languages with space-separated words but
struggles with OOV and morphological variation.
Character Tokenization: Best for language-agnostic tasks but computationally
expensive and loses semantic structure.
Sub-Word Tokenization (BPE): A practical middle-ground solution widely used in
modern NLP.
String Matching and Spelling Correction: Minimum Edit Distance
Minimum Edit Distance is the minimum number of operations required to convert one
string into another. Common operations include:
1. Insertion: Add a character.
2. Deletion: Remove a character.
3. Substitution: Replace one character with another.
This is widely used in:
Spelling correction
DNA sequence analysis
Natural language processing tasks
Dynamic Programming Approach
Dynamic programming is used to calculate the minimum edit distance efficiently by filling
a table (dp) to compute the result iteratively.
Steps to Calculate Minimum Edit Distance
1. Define the Problem
Let the two strings be:
s1 (source string) of length m
s2 (target string) of length n
We need to compute the cost of converting s1 into s2.
2. Define the Table
Create a 2D table dp[m+1][n+1].
dp[i][j] represents the minimum edit distance to convert the first i characters of s1 into
the first j characters of s2.
3. Initialization
1. If one string is empty, the cost is the length of the other string.
o dp[i][0] = i (Cost of deleting all characters from s1)
o dp[0][j] = j (Cost of inserting all characters into s1)
4. Recurrence Relation
For every character in s1 and s2, compare:
1. If characters are the same:
o No cost: dp[i][j] = dp[i-1][j-1]
2. If characters are different:
o Take the minimum of:
Insertion: dp[i][j-1] + 1
Deletion: dp[i-1][j] + 1
Substitution: dp[i-1][j-1] + 1
Formula:
dp[i][j] = min(
dp[i-1][j] + 1, # Deletion
dp[i][j-1] + 1, # Insertion
dp[i-1][j-1] + cost # Substitution (cost = 0 if s1[i-1] == s2[j-1], else 1)
)
5. Algorithm
def minimum_edit_distance(s1, s2):
m, n = len(s1), len(s2)
# Initialize DP table
dp = [[0 for _ in range(n + 1)] for _ in range(m + 1)]
# Fill base cases
for i in range(m + 1):
dp[i][0] = i # Cost of deleting all characters
for j in range(n + 1):
dp[0][j] = j # Cost of inserting all characters
# Fill the table
for i in range(1, m + 1):
for j in range(1, n + 1):
if s1[i - 1] == s2[j - 1]: # Characters match
dp[i][j] = dp[i - 1][j - 1]
else:
dp[i][j] = min(
dp[i - 1][j] + 1, # Deletion
dp[i][j - 1] + 1, # Insertion
dp[i - 1][j - 1] + 1 # Substitution
)
# Minimum edit distance is in the bottom-right corner
return dp[m][n]
# Example Usage
s1 = "kitten"
s2 = "sitting"
print("Minimum Edit Distance:", minimum_edit_distance(s1, s2))
# Output: Minimum Edit Distance: 3
6. Table Filling Example
For s1 = "kitten" and s2 = "sitting":
s i t t i ng
01234567
k11234567
i 22123456
t 33212345
t 44321234
e 55432234
n66543323
7. Analysis
1. Time Complexity:
o O(m⋅n)O(m \cdot n), where mm and nn are the lengths of the strings.
2. Space Complexity:
o O(m⋅n)O(m \cdot n) for the table.
3. Applications:
o Spell checkers (suggesting closest corrections).
o DNA sequence alignment (biological string matching).
o Plagiarism detection.