0% found this document useful (0 votes)

11 views17 pages

Module II

This document provides an overview of regular expressions (regex) and their key functions, including meta-characters, pattern objects, and methods for matching and searching text. It also covers string modification methods like split, sub, and subn, as well as text processing concepts using spaCy, such as tokenization, counting words, and vocabulary extraction. Finally, it outlines a step-by-step guide for sentiment classification using a review dataset, detailing data preparation, vocabulary building, and encoding.

Uploaded by

Remya Anish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views17 pages

Module II

Uploaded by

Remya Anish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 17

Here’s a brief note on common regex functions and their key meta-characters:

Meta-Characters in Regex

1. Big Brackets (()):

o Used for grouping expressions or capturing groups.
 Example: (abc) matches "abc".
o Can extract specific parts of a match.
 Example: (a|b)c matches "ac" or "bc".
2. Caret (^):
o When at the start, matches the beginning of a string.
 Example: ^abc matches "abc" only if it’s at the start.
o Inside square brackets [^...], it negates the character set.
 Example: [^abc] matches any character except "a", "b", or "c".
3. Backslash (\):
o Used for escaping special characters (e.g., \. to match a literal period).
o Indicates special sequences like:
 \d → Digits (0-9).
 \D → Non-digits.
 \w → Word characters (letters, digits, underscore).
 \W → Non-word characters.
 \s → Whitespace characters.
 \S → Non-whitespace characters.
4. Square Brackets ([]):
o Used to define a character set. Matches any single character inside.
 Example: [abc] matches "a", "b", or "c".
o Can define ranges.
 Example: [a-z] matches any lowercase letter.
5. Special Sequences:
o Shortcuts for common patterns using a backslash. Examples include:
 \b → Word boundary.
 \B → Non-word boundary.
 \A → Start of the string.
 \Z → End of the string.
6. Asterisk (*):
o Matches zero or more occurrences of the preceding character or group.
 Example: ab* matches "a", "ab", "abb", etc.
7. Plus (+):
o Matches one or more occurrences of the preceding character or group.
 Example: ab+ matches "ab", "abb", etc., but not "a".
8. Question Mark (?):
o Matches zero or one occurrence of the preceding character or group.
 Example: ab? matches "a" or "ab".
o Used for lazy matching (non-greedy).
 Example: a+? matches the shortest "a".
9. Curly Brackets ({}):
o Used to define a specific number of occurrences.
 Example: a{2} matches "aa".
o Ranges can also be defined:
 {m,n} matches between m and n times.
 {m,} matches m or more times.
 {,n} matches up to n times.

Summary Table

Meta-Character Function Example Matches

() Grouping (ab)+ "ab", "abab"
^ Start of string ^abc "abc" (start only)
\ Escaping/Special Sequences \d, \. Digits, literal "."
[] Character set [a-c] "a", "b", or "c"
* Zero or more ab* "a", "ab", "abb"
+ One or more ab+ "ab", "abb"
? Zero or one, or lazy matching ab? or a+? "a", "ab" / shortest "a"
{} Specific or range occurrences a{2,3} "aa", "aaa"

Pattern Objects in Regex

A pattern object is a compiled representation of a regular expression. It is created using the

re.compile() function and provides methods to match or search for text efficiently.

Example:

import re
pattern = re.compile(r'\d+') # Compiles a regex for matching one or more
digits

Match vs. Search Methods

1. match() Method:
o Tries to match the pattern only at the beginning of the string.
o Returns a match object if successful, else None.
o Example:
o result = pattern.match("123abc")
o print(result.group()) # Output: "123"

2. search() Method:
o Scans the entire string and returns the first match found.
o Returns a match object if successful, else None.
o Example:
o result = pattern.search("abc123xyz")
o print(result.group()) # Output: "123"

finditer() Method

 Finds all non-overlapping matches of the pattern in the string.

 Returns an iterator of match objects.
 Useful for iterating over multiple matches and extracting details like position.
 Example:
 matches = pattern.finditer("123abc456")
 for match in matches:
 print(match.group(), match.start(), match.end())
 # Output:
 # 123 0 3
 # 456 6 9

Logical OR (|)

 Matches either of the patterns separated by the pipe (|).

 Example:
 pattern = re.compile(r'apple|banana')
 result = pattern.search("I like banana")
 print(result.group()) # Output: "banana"

Beginning and End Patterns

1. Caret (^):
o Matches the beginning of a string.
 Example: ^abc matches "abc" only if it’s at the start.

2. Dollar Sign ($):

o Matches the end of a string.
 Example: xyz$ matches "xyz" only if it’s at the end.

3. Combine both to match entire string:

o Example: ^abc$ matches "abc" if it’s the whole string.

Parentheses (())

1. Grouping:
o Groups part of the regex for operations like quantifiers or logical OR.
o Example: (abc)+ matches "abc", "abcabc", etc.

2. Capturing:
o Captures the matched substring for later use.
o Example:
o pattern = re.compile(r'(\d+)-(\w+)')
o result = pattern.search("123-abc")
o print(result.group(1)) # Output: "123"
o print(result.group(2)) # Output: "abc"

3. Non-Capturing Groups:
o Use (?:...) to group without capturing.
o Example: (?:abc)+ groups "abc" but does not create a capture group.

Quick Comparison Table

Method/Feature Purpose Example Result

match() Match at the beginning re.match(r'\d+', '123abc') Match: "123"

Method/Feature Purpose Example Result

Find first match re.search(r'\d+',

search()
'abc123xyz') Match: "123"
anywhere

Find all matches with re.finditer(r'\d+', "123" at 0–3, "456"

finditer()
positions '123abc456') at 6–9

Logical OR (` `) Matches one of the alternatives `r'apple

Beginning (^) and end

^ and $ ^abc$ Matches "abc" only
($) of string

Matches "ab",
Parentheses (()) Grouping and capturing (ab)+
"abab", etc.

Here's a brief explanation of string modification methods in regex: split, sub, and subn.

1. split() Method

 Purpose: Splits a string into a list using a regex pattern as the delimiter.
 Syntax: re.split(pattern, string, maxsplit=0)
o pattern: Regex pattern used for splitting.
o string: Input string to split.
o maxsplit: Maximum number of splits (default is 0, which means no limit).
 Example:
 import re
 result = re.split(r'\s+', 'This is a test string')
 print(result)
 # Output: ['This', 'is', 'a', 'test', 'string']
 Advanced Example:
 result = re.split(r'[.,]', 'apple,orange.banana')
 print(result)
 # Output: ['apple', 'orange', 'banana']

2. sub() Method

 Purpose: Replaces all occurrences of a pattern in a string with a specified

replacement.
 Syntax: re.sub(pattern, replacement, string, count=0)
o pattern: Regex pattern to find.
o replacement: Replacement string.
o string: Input string.
o count: Maximum number of replacements (default is 0, which means replace
all).
 Example:
 result = re.sub(r'\d+', '#', 'Order123 and Order456')
 print(result)
 # Output: 'Order# and Order#'
 Advanced Example:
 result = re.sub(r'(\d+)', r'[\1]', 'Item 123, Item 456')
 print(result)
 # Output: 'Item [123], Item [456]'

3. subn() Method

 Purpose: Works like sub() but returns a tuple containing:

1. The modified string.
2. The number of replacements made.
 Syntax: re.subn(pattern, replacement, string, count=0)
 Example:
 result = re.subn(r'\d+', '#', 'Order123 and Order456')
 print(result)
 # Output: ('Order# and Order#', 2)
 Advanced Example:
 result = re.subn(r'[aeiou]', '*', 'banana')
 print(result)
 # Output: ('b*n*n*', 3)

Key Differences

Method Purpose Output

split() Splits the string into a list using regex List of substrings
sub() Replaces matches with a replacement Modified string
subn() Like sub() but also returns count Tuple: (modified string, count)

Here’s a concise explanation of text processing concepts and how they are handled using
spaCy:

Key Concepts in Text Processing

1. Words:
o Basic units of text.
o Words are typically separated by spaces or punctuation in a sentence.
2. Tokens:
o A token is a segment of text (e.g., words, punctuation, or symbols).
o Tokenization is the process of breaking text into these segments.
3. Counting Words:
o Determining the number of words or tokens in a text.
o Includes removing duplicates (for unique words) or punctuation depending on
requirements.
4. Vocabulary:
o The set of unique words in a text or corpus.
o Often built after preprocessing (e.g., converting text to lowercase, removing
stopwords).
5. Corpus:
o A collection of texts used for analysis.
o Can be a single document or a large dataset (e.g., Wikipedia articles).
6. Tokenization:
o Splitting text into smaller units (tokens), such as words or subwords.
o spaCy provides efficient and accurate tokenization that respects language
rules.

Text Processing with spaCy

1. Tokenization in spaCy

 Basic Tokenization:
 import spacy
 nlp = spacy.load("en_core_web_sm") # Load spaCy model
 doc = nlp("This is a sample sentence!")
 for token in doc:
 print(token.text)
 # Output:
 # This
 # is
 # a
 # sample
 # sentence
 # !
 Customizing Tokenization:
o SpaCy allows customizing tokenization rules using nlp.tokenizer.

2. Counting Words

 Total tokens in a document:

 print(len(doc)) # Total number of tokens
 Count specific word occurrences:
 from collections import Counter
 word_counts = Counter([token.text for token in doc])
 print(word_counts)
 # Example Output: Counter({'is': 1, 'This': 1, 'a': 1, 'sample': 1,
'sentence': 1, '!': 1})

3. Vocabulary Extraction

 Extracting unique words:

 vocabulary = set([token.text.lower() for token in doc if
token.is_alpha])
 print(vocabulary)
 # Example Output: {'this', 'is', 'a', 'sample', 'sentence'}

4. Corpus Handling
 Apply spaCy to a corpus (multiple documents):
 texts = ["This is the first document.", "This is the second."]
 docs = [nlp(text) for text in texts]
 for doc in docs:
 print([token.text for token in doc])

5. Advanced Token Attributes

 SpaCy tokens provide rich linguistic information:

 for token in doc:
 print(f"Text: {token.text}, Lemma: {token.lemma_}, POS:
{token.pos_}")
 # Output:
 # Text: This, Lemma: this, POS: DET
 # Text: is, Lemma: be, POS: AUX
 # Text: a, Lemma: a, POS: DET
 # Text: sample, Lemma: sample, POS: NOUN
 # Text: sentence, Lemma: sentence, POS: NOUN
 # Text: !, Lemma: !, POS: PUNCT

Workflow Example

1. Preprocessing:
o Lowercasing, punctuation removal, stopword removal.
2. Tokenization:
o Break text into tokens (words, punctuation, etc.).
3. Vocabulary Building:
o Collect unique tokens.
4. Word Counting:
o Count occurrences of words/tokens.

Summary Table

Concept Description Example with spaCy

Words Basic text units "word1 word2" → ['word1', 'word2']
Tokens Segments of text "word1, word2!" → ['word1', ',', 'word2', '!']
Counting Counting occurrences of Counter({'word1': 1, 'word2': 1})
Words tokens/words
Vocabulary Unique set of words {'word1', 'word2'}
["doc1 text", "doc2 text"] → Tokenize
Corpus Collection of texts
each separately.
spaCy doc = nlp("This is a test.") →
Tokenization Breaking text into tokens
Tokens: ['This', 'is', 'a', 'test', '.']
Here's a step-by-step guide for performing sentiment classification using a review dataset
(e.g., Yelp), including data preparation, vocabulary building, encoding, and evaluation.

Steps for Sentiment Classification

1. Download and Load Dataset
 Use a Yelp review dataset or any text review dataset. Assume it contains two
columns: text (reviews) and sentiment (labels: positive/negative).
import pandas as pd

# Load dataset
url = "https://example.com/yelp_reviews.csv" # Replace with your dataset URL or path
df = pd.read_csv(url)

# Inspect data
print(df.head())
# Output:
# text sentiment
# 0 "Great food and service!" positive
# 1 "Terrible experience!" negative

2. Data Preparation
1. Cleaning and Tokenization:
o Remove special characters, lowercasing, and tokenization.
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

def preprocess(text):
# Lowercase and remove special characters
text = re.sub(r'[^\w\s]', '', text.lower())
# Tokenize and remove stopwords
tokens = [word for word in word_tokenize(text) if word not in stop_words]
return tokens
df['tokens'] = df['text'].apply(preprocess)
print(df.head())
2. Build Vocabulary:
o Use tokens from the entire dataset to construct a vocabulary.
from collections import Counter

# Flatten tokens and count word frequencies

all_tokens = [token for tokens in df['tokens'] for token in tokens]
vocab = Counter(all_tokens)

# Build a vocabulary dictionary

vocab_dict = {word: idx + 1 for idx, (word, _) in enumerate(vocab.items())}
print(f"Vocabulary size: {len(vocab_dict)}")

3. Encode Data
1. One-Hot Encoding:
o Convert tokens to one-hot vectors based on the vocabulary.
def one_hot_encode(tokens, vocab_dict):
return [vocab_dict[word] for word in tokens if word in vocab_dict]

df['encoded'] = df['tokens'].apply(lambda x: one_hot_encode(x, vocab_dict))

print(df.head())
2. Padding Sequences:
o Ensure all sequences have the same length.
from tensorflow.keras.preprocessing.sequence import pad_sequences

max_length = 50 # Define max length for reviews

df['padded'] = pad_sequences(df['encoded'], maxlen=max_length, padding='post').tolist()
4. Train-Test Split
from sklearn.model_selection import train_test_split
import numpy as np

# Convert labels to numeric

df['label'] = df['sentiment'].map({'positive': 1, 'negative': 0})

# Split into features (X) and target (y)

X = np.array(df['padded'].tolist())
y = np.array(df['label'])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

5. Feature Computation and Model Training

1. Train a Model:
o Use a simple logistic regression or an ML model.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

# Flatten features for models that expect 2D input

X_train_flat = X_train.reshape(X_train.shape[0], -1)
X_test_flat = X_test.reshape(X_test.shape[0], -1)

# Train a logistic regression model

model = LogisticRegression(max_iter=1000)
model.fit(X_train_flat, y_train)

# Predict on test data

y_pred = model.predict(X_test_flat)
6. Evaluation
1. Confusion Matrix:
o Analyze model performance.
from sklearn.metrics import ConfusionMatrixDisplay

# Compute confusion matrix

conf_matrix = confusion_matrix(y_test, y_pred)

# Display confusion matrix

ConfusionMatrixDisplay(conf_matrix, display_labels=['Negative', 'Positive']).plot()
2. Performance Metrics:
o Classification report provides precision, recall, F1-score.
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))

Key Insights from Analysis

1. Confusion Matrix:
o Diagonal values show correct predictions.
o Off-diagonal values indicate misclassifications.
2. Precision, Recall, F1-Score:
o Assess how well the model handles imbalanced data.
3. Vocabulary and Feature Engineering:
o Larger vocabularies can increase accuracy but might overfit. Experiment with
stopword removal or stemming.

Language-Independent Tokenization
Tokenization is a crucial step in natural language processing (NLP) to break down text into
smaller units (tokens). The process is tailored to different use cases and text structures, often
depending on language-specific or task-specific requirements.

1. Types of Tokenization
a. Word Tokenization
 Splits text into words or word-like units.
 Commonly used for tasks like machine translation, sentiment analysis, and text
classification.
Example:
Input: "Tokenization is crucial!"
Output: ["Tokenization", "is", "crucial", "!"]
Problems:
o Language dependence: Rules for word boundaries differ across languages.
 Example: In Chinese or Japanese, words are not space-separated.
o Out-of-vocabulary (OOV) words: Fails for unknown or rare words.
 Example: "unfathomably" won't match a pre-trained vocabulary.
o Ambiguity: Hyphenated words or contractions may be split incorrectly.

b. Character Tokenization
 Splits text into individual characters.
 Language-agnostic and works well for morphologically rich or space-free languages.
Example:
Input: "Tokenization"
Output: ["T", "o", "k", "e", "n", "i", "z", "a", "t", "i", "o", "n"]
Drawbacks:
o Longer sequences: Increases sequence length, making models
computationally expensive.
o Loss of semantic structure: Characters alone don’t capture meaningful
information.
 Example: The meaning of "word" is lost when broken into ["w", "o",
"r", "d"].
o Difficulty in learning dependencies: Requires more training data for
meaningful patterns.

c. Sub-Word Tokenization
 Breaks words into smaller units (sub-words) based on frequency patterns.
 Strikes a balance between word and character tokenization.
 Common methods include Byte Pair Encoding (BPE) and WordPiece.
Example (BPE):
Input: "unbelievable"
Output: ["un", "believ", "able"]
Problems:
o Complexity: Requires pre-processing to build sub-word vocabularies.
o Language nuances: Morphologically rich languages may still require
additional handling.

2. Byte Pair Encoding (BPE)

What is BPE?
 A sub-word tokenization algorithm based on merging the most frequent pairs of
characters or sub-words iteratively until a vocabulary size is reached.
 Frequently used in NLP models like GPT and BERT.
Steps:
1. Start with character-level tokens.
2. "unbelievable" → ["u", "n", "b", "e", "l", "i", "e", "v", "a", "b", "l", "e"]
3. Identify the most frequent adjacent pairs and merge them.
4. "b", "e" → "be" → ["u", "n", "be", "l", "i", "e", "v", "a", "b", "l", "e"]
5. Repeat until the vocabulary size is reached.
Advantages:
o Reduces OOV problems by handling rare words as sub-words.
o Efficient storage and processing compared to word-level tokenization.
o Language-independent.
Challenges:
o Training is computationally expensive.
o May split some meaningful words into less interpretable sub-words.

Comparison of Tokenization Methods

Word Character Sub-Word Tokenization
Aspect
Tokenization Tokenization (BPE)

Granularity Words Characters Sub-words

OOV Handling Poor Excellent Good

Sequence Length Moderate Long Short to Moderate

Semantic Retention Good Poor Moderate

Language
High Low Moderate
Dependence

Computational Cost Low High Moderate

Conclusion
 Word Tokenization: Works well for languages with space-separated words but
struggles with OOV and morphological variation.
 Character Tokenization: Best for language-agnostic tasks but computationally
expensive and loses semantic structure.
 Sub-Word Tokenization (BPE): A practical middle-ground solution widely used in
modern NLP.
String Matching and Spelling Correction: Minimum Edit Distance
Minimum Edit Distance is the minimum number of operations required to convert one
string into another. Common operations include:
1. Insertion: Add a character.
2. Deletion: Remove a character.
3. Substitution: Replace one character with another.
This is widely used in:
 Spelling correction
 DNA sequence analysis
 Natural language processing tasks

Dynamic Programming Approach

Dynamic programming is used to calculate the minimum edit distance efficiently by filling
a table (dp) to compute the result iteratively.
Steps to Calculate Minimum Edit Distance
1. Define the Problem
Let the two strings be:
 s1 (source string) of length m
 s2 (target string) of length n
We need to compute the cost of converting s1 into s2.

2. Define the Table

 Create a 2D table dp[m+1][n+1].
 dp[i][j] represents the minimum edit distance to convert the first i characters of s1 into
the first j characters of s2.

3. Initialization
1. If one string is empty, the cost is the length of the other string.
o dp[i][0] = i (Cost of deleting all characters from s1)
o dp[0][j] = j (Cost of inserting all characters into s1)

4. Recurrence Relation
For every character in s1 and s2, compare:
1. If characters are the same:
o No cost: dp[i][j] = dp[i-1][j-1]
2. If characters are different:
o Take the minimum of:
 Insertion: dp[i][j-1] + 1
 Deletion: dp[i-1][j] + 1
 Substitution: dp[i-1][j-1] + 1
Formula:
dp[i][j] = min(
dp[i-1][j] + 1, # Deletion
dp[i][j-1] + 1, # Insertion
dp[i-1][j-1] + cost # Substitution (cost = 0 if s1[i-1] == s2[j-1], else 1)
)

5. Algorithm
def minimum_edit_distance(s1, s2):
m, n = len(s1), len(s2)
# Initialize DP table
dp = [[0 for _ in range(n + 1)] for _ in range(m + 1)]

# Fill base cases

for i in range(m + 1):
dp[i][0] = i # Cost of deleting all characters
for j in range(n + 1):
dp[0][j] = j # Cost of inserting all characters

# Fill the table

for i in range(1, m + 1):
for j in range(1, n + 1):
if s1[i - 1] == s2[j - 1]: # Characters match
dp[i][j] = dp[i - 1][j - 1]
else:
dp[i][j] = min(
dp[i - 1][j] + 1, # Deletion
dp[i][j - 1] + 1, # Insertion
dp[i - 1][j - 1] + 1 # Substitution
)

# Minimum edit distance is in the bottom-right corner

return dp[m][n]

# Example Usage
s1 = "kitten"
s2 = "sitting"
print("Minimum Edit Distance:", minimum_edit_distance(s1, s2))
# Output: Minimum Edit Distance: 3

6. Table Filling Example

For s1 = "kitten" and s2 = "sitting":
s i t t i ng

01234567

k11234567

i 22123456

t 33212345

t 44321234

e 55432234

n66543323

7. Analysis
1. Time Complexity:
o O(m⋅n)O(m \cdot n), where mm and nn are the lengths of the strings.
2. Space Complexity:
o O(m⋅n)O(m \cdot n) for the table.
3. Applications:
o Spell checkers (suggesting closest corrections).
o DNA sequence alignment (biological string matching).
o Plagiarism detection.

Regular Expression
No ratings yet
Regular Expression
65 pages
Untitled
No ratings yet
Untitled
53 pages
PP - Module-3 Notes
No ratings yet
PP - Module-3 Notes
56 pages
9 RegEx
No ratings yet
9 RegEx
57 pages
Unit 2
No ratings yet
Unit 2
69 pages
Regular Expressions
No ratings yet
Regular Expressions
104 pages
UNIT4
No ratings yet
UNIT4
67 pages
Structuring With Regix
No ratings yet
Structuring With Regix
49 pages
Unit 3 Python
No ratings yet
Unit 3 Python
72 pages
Regular Expression L
No ratings yet
Regular Expression L
20 pages
17 - Regular Expression
No ratings yet
17 - Regular Expression
20 pages
9 RegEx
No ratings yet
9 RegEx
57 pages
Lec 06 - Regular Expression
No ratings yet
Lec 06 - Regular Expression
19 pages
Unit7 RegularExpressionpdf 2023 10 17 09 16 29
No ratings yet
Unit7 RegularExpressionpdf 2023 10 17 09 16 29
17 pages
13B RegExp
No ratings yet
13B RegExp
38 pages
Regular Expression
No ratings yet
Regular Expression
21 pages
Python Regex Cheat Sheet
No ratings yet
Python Regex Cheat Sheet
29 pages
Regular Expression 01
No ratings yet
Regular Expression 01
48 pages
Python 201 - (Slightly) Advanced Python Topics
No ratings yet
Python 201 - (Slightly) Advanced Python Topics
69 pages
Unit - 4 Regex
No ratings yet
Unit - 4 Regex
28 pages
Regular Expressions in Python
No ratings yet
Regular Expressions in Python
12 pages
Regular Exp
No ratings yet
Regular Exp
10 pages
Reg Exp
No ratings yet
Reg Exp
10 pages
Python Re
No ratings yet
Python Re
18 pages
Natural Language Processing - Session 3 - Regular Expressions
No ratings yet
Natural Language Processing - Session 3 - Regular Expressions
39 pages
Regular Expression 4
No ratings yet
Regular Expression 4
16 pages
Module5 RegularExpressions
No ratings yet
Module5 RegularExpressions
10 pages
Howto Regex
No ratings yet
Howto Regex
20 pages
Regular
No ratings yet
Regular
9 pages
Text Processing For NLP Understanding Regex
No ratings yet
Text Processing For NLP Understanding Regex
16 pages
Chapter 10
No ratings yet
Chapter 10
28 pages
Regex Summary
No ratings yet
Regex Summary
8 pages
9python Simple Character Matches
No ratings yet
9python Simple Character Matches
19 pages
Regular Expressions (Slides)
No ratings yet
Regular Expressions (Slides)
20 pages
RegEx in Python
No ratings yet
RegEx in Python
5 pages
Regex Case Interview Guide
No ratings yet
Regex Case Interview Guide
10 pages
Unit-3 Python
No ratings yet
Unit-3 Python
72 pages
RegEx in Python
No ratings yet
RegEx in Python
6 pages
Python Reg Expressions
No ratings yet
Python Reg Expressions
8 pages
Full Python Regex Questions Detailed
No ratings yet
Full Python Regex Questions Detailed
4 pages
Manipulating Text With Regular Expression in Python
No ratings yet
Manipulating Text With Regular Expression in Python
4 pages
Lecture 6 Re Basics
No ratings yet
Lecture 6 Re Basics
12 pages
Python Assignment Date: 08-11-2021: Name-Navjeet Kaur Sap ID-500076160 Roll No - R134219065
No ratings yet
Python Assignment Date: 08-11-2021: Name-Navjeet Kaur Sap ID-500076160 Roll No - R134219065
3 pages
Lecture 7 Re Part2 Split
No ratings yet
Lecture 7 Re Part2 Split
8 pages
Howto Regex
No ratings yet
Howto Regex
20 pages
Python Regular Expression
100% (1)
Python Regular Expression
31 pages
Python Course: Session 6b - Regular Expressions
No ratings yet
Python Course: Session 6b - Regular Expressions
11 pages
2 - Python Strings
No ratings yet
2 - Python Strings
23 pages
Regular Expression HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Regular Expression HOWTO: Guido Van Rossum and The Python Development Team
18 pages
Python Regular Expressions Quick Reference
No ratings yet
Python Regular Expressions Quick Reference
2 pages
Python Reg Expressions PDF
No ratings yet
Python Reg Expressions PDF
8 pages
Java - UNIT I
No ratings yet
Java - UNIT I
19 pages
Python Regular Expressions
No ratings yet
Python Regular Expressions
14 pages
Howto Regex
No ratings yet
Howto Regex
20 pages
Howto Regex
No ratings yet
Howto Regex
17 pages
An Introduction To Combinatorics and Graph Theory: David Guichard
100% (1)
An Introduction To Combinatorics and Graph Theory: David Guichard
155 pages
Howto Regex PDF
No ratings yet
Howto Regex PDF
20 pages
Contents & Abstract of CNC Project
100% (5)
Contents & Abstract of CNC Project
5 pages
Python RegEx
No ratings yet
Python RegEx
11 pages
Bookshop Management System
40% (10)
Bookshop Management System
62 pages
Literature Review - Neenu K Sureshpptx
No ratings yet
Literature Review - Neenu K Sureshpptx
20 pages
Module III
No ratings yet
Module III
42 pages
Unit 1 INTRODUCTION
No ratings yet
Unit 1 INTRODUCTION
62 pages
DXL1702
No ratings yet
DXL1702
66 pages
Module V
No ratings yet
Module V
19 pages
The Coad and Yourdon Method
No ratings yet
The Coad and Yourdon Method
11 pages
MySQL Interview Questions and Answers For Experienced and Freshers
No ratings yet
MySQL Interview Questions and Answers For Experienced and Freshers
5 pages
Understanding Operating Systems 7Th Edition by Ida Flynn, Ann Mciver Mchoes 128509655X 978-1285096551
100% (7)
Understanding Operating Systems 7Th Edition by Ida Flynn, Ann Mciver Mchoes 128509655X 978-1285096551
77 pages
Python Regex
No ratings yet
Python Regex
8 pages
RULES ON ELECTRONIC EVIDENCE (A.M. No. 01-7-01-SC) PDF
No ratings yet
RULES ON ELECTRONIC EVIDENCE (A.M. No. 01-7-01-SC) PDF
7 pages
Linguistics Vs Applied Linguistics
0% (1)
Linguistics Vs Applied Linguistics
1 page
Manual Del Picking Sap Business One
No ratings yet
Manual Del Picking Sap Business One
19 pages
VHDL Fir Filter
No ratings yet
VHDL Fir Filter
31 pages
An Introduction To Spectral Methods
100% (1)
An Introduction To Spectral Methods
39 pages
Doa Untuk Pelajar#
No ratings yet
Doa Untuk Pelajar#
29 pages
Process Scale-Up, Validation & Technology Transfer
No ratings yet
Process Scale-Up, Validation & Technology Transfer
6 pages
B.SC Phy H Es P
No ratings yet
B.SC Phy H Es P
3 pages
SQL Cheat Sheet
No ratings yet
SQL Cheat Sheet
2 pages
Section Ii: Cis Environments
No ratings yet
Section Ii: Cis Environments
6 pages
Javascript 1
No ratings yet
Javascript 1
7 pages
Flow Chart
No ratings yet
Flow Chart
7 pages
Deloitte Reviews in Hyderabad, India Area: S A L A R I e S I N T e R V I e W S P H o T o S J o B S
No ratings yet
Deloitte Reviews in Hyderabad, India Area: S A L A R I e S I N T e R V I e W S P H o T o S J o B S
10 pages
Digital Image Correlation - Tracking With Matlab
No ratings yet
Digital Image Correlation - Tracking With Matlab
20 pages
Agni College of Technology: Lesson Plan - Excluding Coaching Class
No ratings yet
Agni College of Technology: Lesson Plan - Excluding Coaching Class
5 pages
Operating Systems - CSE, DS, CS, IT - III-IR20-May-2023
No ratings yet
Operating Systems - CSE, DS, CS, IT - III-IR20-May-2023
2 pages
Multiple Deletion of Sales Document
No ratings yet
Multiple Deletion of Sales Document
2 pages
Introduction To Oracle Primavera P6 v7
No ratings yet
Introduction To Oracle Primavera P6 v7
7 pages
How To Write A Good Bug Report
No ratings yet
How To Write A Good Bug Report
4 pages
Installation Steps New
No ratings yet
Installation Steps New
1 page
Resume: PR Ajees H Ku Mar. P
No ratings yet
Resume: PR Ajees H Ku Mar. P
3 pages
Arlecore Readme
No ratings yet
Arlecore Readme
3 pages
Mathematics Paper - I (1) Linear Algebra
No ratings yet
Mathematics Paper - I (1) Linear Algebra
2 pages
Topology Essentials
From Everand
Topology Essentials
Emil G. Milewski
5/5 (1)
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet

Module II

Uploaded by

Module II

Uploaded by

Here’s a brief note on common regex functions and their key meta-characters:

1. Big Brackets (()):

Meta-Character Function Example Matches

Pattern Objects in Regex

A pattern object is a compiled representation of a regular expression. It is created using the

Match vs. Search Methods

 Finds all non-overlapping matches of the pattern in the string.

 Matches either of the patterns separated by the pipe (|).

Beginning and End Patterns

2. Dollar Sign ($):

3. Combine both to match entire string:

Quick Comparison Table

Method/Feature Purpose Example Result

match() Match at the beginning re.match(r'\d+', '123abc') Match: "123"

Find first match re.search(r'\d+',

Find all matches with re.finditer(r'\d+', "123" at 0–3, "456"

Logical OR (` `) Matches one of the alternatives `r'apple

Beginning (^) and end

 Purpose: Replaces all occurrences of a pattern in a string with a specified

 Purpose: Works like sub() but returns a tuple containing:

Method Purpose Output

Key Concepts in Text Processing

Text Processing with spaCy

 Total tokens in a document:

 Extracting unique words:

5. Advanced Token Attributes

 SpaCy tokens provide rich linguistic information:

Concept Description Example with spaCy

Steps for Sentiment Classification

# Flatten tokens and count word frequencies

# Build a vocabulary dictionary

df['encoded'] = df['tokens'].apply(lambda x: one_hot_encode(x, vocab_dict))

max_length = 50 # Define max length for reviews

# Convert labels to numeric

# Split into features (X) and target (y)

5. Feature Computation and Model Training

# Flatten features for models that expect 2D input

# Train a logistic regression model

# Predict on test data

# Compute confusion matrix

# Display confusion matrix

Key Insights from Analysis

2. Byte Pair Encoding (BPE)

Comparison of Tokenization Methods

Granularity Words Characters Sub-words

OOV Handling Poor Excellent Good

Sequence Length Moderate Long Short to Moderate

Semantic Retention Good Poor Moderate

Computational Cost Low High Moderate

Dynamic Programming Approach

2. Define the Table

# Fill base cases

# Fill the table

# Minimum edit distance is in the bottom-right corner

6. Table Filling Example

You might also like