Module II
Module II
Meta-Characters in Regex
Summary Table
Example:
import re
pattern = re.compile(r'\d+') # Compiles a regex for matching one or more
digits
1. match() Method:
o Tries to match the pattern only at the beginning of the string.
o Returns a match object if successful, else None.
o Example:
o result = pattern.match("123abc")
o print(result.group()) # Output: "123"
2. search() Method:
o Scans the entire string and returns the first match found.
o Returns a match object if successful, else None.
o Example:
o result = pattern.search("abc123xyz")
o print(result.group()) # Output: "123"
finditer() Method
Logical OR (|)
1. Caret (^):
o Matches the beginning of a string.
Example: ^abc matches "abc" only if it’s at the start.
Parentheses (())
1. Grouping:
o Groups part of the regex for operations like quantifiers or logical OR.
o Example: (abc)+ matches "abc", "abcabc", etc.
2. Capturing:
o Captures the matched substring for later use.
o Example:
o pattern = re.compile(r'(\d+)-(\w+)')
o result = pattern.search("123-abc")
o print(result.group(1)) # Output: "123"
o print(result.group(2)) # Output: "abc"
3. Non-Capturing Groups:
o Use (?:...) to group without capturing.
o Example: (?:abc)+ groups "abc" but does not create a capture group.
Matches "ab",
Parentheses (()) Grouping and capturing (ab)+
"abab", etc.
Here's a brief explanation of string modification methods in regex: split, sub, and subn.
1. split() Method
Purpose: Splits a string into a list using a regex pattern as the delimiter.
Syntax: re.split(pattern, string, maxsplit=0)
o pattern: Regex pattern used for splitting.
o string: Input string to split.
o maxsplit: Maximum number of splits (default is 0, which means no limit).
Example:
import re
result = re.split(r'\s+', 'This is a test string')
print(result)
# Output: ['This', 'is', 'a', 'test', 'string']
Advanced Example:
result = re.split(r'[.,]', 'apple,orange.banana')
print(result)
# Output: ['apple', 'orange', 'banana']
2. sub() Method
3. subn() Method
Key Differences
Here’s a concise explanation of text processing concepts and how they are handled using
spaCy:
1. Words:
o Basic units of text.
o Words are typically separated by spaces or punctuation in a sentence.
2. Tokens:
o A token is a segment of text (e.g., words, punctuation, or symbols).
o Tokenization is the process of breaking text into these segments.
3. Counting Words:
o Determining the number of words or tokens in a text.
o Includes removing duplicates (for unique words) or punctuation depending on
requirements.
4. Vocabulary:
o The set of unique words in a text or corpus.
o Often built after preprocessing (e.g., converting text to lowercase, removing
stopwords).
5. Corpus:
o A collection of texts used for analysis.
o Can be a single document or a large dataset (e.g., Wikipedia articles).
6. Tokenization:
o Splitting text into smaller units (tokens), such as words or subwords.
o spaCy provides efficient and accurate tokenization that respects language
rules.
1. Tokenization in spaCy
Basic Tokenization:
import spacy
nlp = spacy.load("en_core_web_sm") # Load spaCy model
doc = nlp("This is a sample sentence!")
for token in doc:
print(token.text)
# Output:
# This
# is
# a
# sample
# sentence
# !
Customizing Tokenization:
o SpaCy allows customizing tokenization rules using nlp.tokenizer.
2. Counting Words
3. Vocabulary Extraction
4. Corpus Handling
Apply spaCy to a corpus (multiple documents):
texts = ["This is the first document.", "This is the second."]
docs = [nlp(text) for text in texts]
for doc in docs:
print([token.text for token in doc])
Workflow Example
1. Preprocessing:
o Lowercasing, punctuation removal, stopword removal.
2. Tokenization:
o Break text into tokens (words, punctuation, etc.).
3. Vocabulary Building:
o Collect unique tokens.
4. Word Counting:
o Count occurrences of words/tokens.
Summary Table
# Load dataset
url = "https://example.com/yelp_reviews.csv" # Replace with your dataset URL or path
df = pd.read_csv(url)
# Inspect data
print(df.head())
# Output:
# text sentiment
# 0 "Great food and service!" positive
# 1 "Terrible experience!" negative
2. Data Preparation
1. Cleaning and Tokenization:
o Remove special characters, lowercasing, and tokenization.
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
def preprocess(text):
# Lowercase and remove special characters
text = re.sub(r'[^\w\s]', '', text.lower())
# Tokenize and remove stopwords
tokens = [word for word in word_tokenize(text) if word not in stop_words]
return tokens
df['tokens'] = df['text'].apply(preprocess)
print(df.head())
2. Build Vocabulary:
o Use tokens from the entire dataset to construct a vocabulary.
from collections import Counter
3. Encode Data
1. One-Hot Encoding:
o Convert tokens to one-hot vectors based on the vocabulary.
def one_hot_encode(tokens, vocab_dict):
return [vocab_dict[word] for word in tokens if word in vocab_dict]
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Language-Independent Tokenization
Tokenization is a crucial step in natural language processing (NLP) to break down text into
smaller units (tokens). The process is tailored to different use cases and text structures, often
depending on language-specific or task-specific requirements.
1. Types of Tokenization
a. Word Tokenization
Splits text into words or word-like units.
Commonly used for tasks like machine translation, sentiment analysis, and text
classification.
Example:
Input: "Tokenization is crucial!"
Output: ["Tokenization", "is", "crucial", "!"]
Problems:
o Language dependence: Rules for word boundaries differ across languages.
Example: In Chinese or Japanese, words are not space-separated.
o Out-of-vocabulary (OOV) words: Fails for unknown or rare words.
Example: "unfathomably" won't match a pre-trained vocabulary.
o Ambiguity: Hyphenated words or contractions may be split incorrectly.
b. Character Tokenization
Splits text into individual characters.
Language-agnostic and works well for morphologically rich or space-free languages.
Example:
Input: "Tokenization"
Output: ["T", "o", "k", "e", "n", "i", "z", "a", "t", "i", "o", "n"]
Drawbacks:
o Longer sequences: Increases sequence length, making models
computationally expensive.
o Loss of semantic structure: Characters alone don’t capture meaningful
information.
Example: The meaning of "word" is lost when broken into ["w", "o",
"r", "d"].
o Difficulty in learning dependencies: Requires more training data for
meaningful patterns.
c. Sub-Word Tokenization
Breaks words into smaller units (sub-words) based on frequency patterns.
Strikes a balance between word and character tokenization.
Common methods include Byte Pair Encoding (BPE) and WordPiece.
Example (BPE):
Input: "unbelievable"
Output: ["un", "believ", "able"]
Problems:
o Complexity: Requires pre-processing to build sub-word vocabularies.
o Language nuances: Morphologically rich languages may still require
additional handling.
Language
High Low Moderate
Dependence
Conclusion
Word Tokenization: Works well for languages with space-separated words but
struggles with OOV and morphological variation.
Character Tokenization: Best for language-agnostic tasks but computationally
expensive and loses semantic structure.
Sub-Word Tokenization (BPE): A practical middle-ground solution widely used in
modern NLP.
String Matching and Spelling Correction: Minimum Edit Distance
Minimum Edit Distance is the minimum number of operations required to convert one
string into another. Common operations include:
1. Insertion: Add a character.
2. Deletion: Remove a character.
3. Substitution: Replace one character with another.
This is widely used in:
Spelling correction
DNA sequence analysis
Natural language processing tasks
3. Initialization
1. If one string is empty, the cost is the length of the other string.
o dp[i][0] = i (Cost of deleting all characters from s1)
o dp[0][j] = j (Cost of inserting all characters into s1)
4. Recurrence Relation
For every character in s1 and s2, compare:
1. If characters are the same:
o No cost: dp[i][j] = dp[i-1][j-1]
2. If characters are different:
o Take the minimum of:
Insertion: dp[i][j-1] + 1
Deletion: dp[i-1][j] + 1
Substitution: dp[i-1][j-1] + 1
Formula:
dp[i][j] = min(
dp[i-1][j] + 1, # Deletion
dp[i][j-1] + 1, # Insertion
dp[i-1][j-1] + cost # Substitution (cost = 0 if s1[i-1] == s2[j-1], else 1)
)
5. Algorithm
def minimum_edit_distance(s1, s2):
m, n = len(s1), len(s2)
# Initialize DP table
dp = [[0 for _ in range(n + 1)] for _ in range(m + 1)]
# Example Usage
s1 = "kitten"
s2 = "sitting"
print("Minimum Edit Distance:", minimum_edit_distance(s1, s2))
# Output: Minimum Edit Distance: 3
01234567
k11234567
i 22123456
t 33212345
t 44321234
e 55432234
n66543323
7. Analysis
1. Time Complexity:
o O(m⋅n)O(m \cdot n), where mm and nn are the lengths of the strings.
2. Space Complexity:
o O(m⋅n)O(m \cdot n) for the table.
3. Applications:
o Spell checkers (suggesting closest corrections).
o DNA sequence alignment (biological string matching).
o Plagiarism detection.