NLP UNIT-I Part-II
NLP UNIT-I Part-II
crucial in NLP applications such as text summarization, machine translation, and speech-to-text processing.
• SBD is not as simple as detecting periods (.) because abbreviations, numbers, and formatting variations can
cause confusion.
Case 1: Abbreviations
• Incorrect detection:
• Dr. John is an expert in NLP. He has worked at Google Inc. since 2015.
• A naive SBD system might incorrectly split after "Dr." and "Inc.", assuming they are sentence boundaries.
• Correct detection:
• To correctly handle such cases, machine learning models or rule-based systems (such as regular expressions) are
used.
Sentence Boundary Detection
Case 2: Numerical Values and Dates
Incorrect detection:
• The temperature in New York was 23.5 degrees yesterday. It will be lower today.
Correct detection:
• Statistical Methods – Use Hidden Markov Models (HMMs) to learn sentence-ending probabilities.
• Machine Learning Methods – Train classifiers like Naïve Bayes, Decision Trees, or Deep Learning
• The stock market opened higher today, with major indices gaining points. Experts attribute the rise to
• Meanwhile, in sports, the local football team secured a victory against their rivals, thrilling fans.
• A Topic Boundary Detection system should recognize that "Stock Market" and "Sports" are separate
topics.
(sentence boundaries or topic changes). One of the most common generative models is the Hidden Markov Model
(HMM).
• A naïve rule-based system might incorrectly split after "Dr." or "Inc.". An HMM-based model assigns probability
How it works
• A local classifier takes each punctuation mark (. or !) and decides whether it marks a sentence boundary.
• Features used
• Punctuation type: . or !.
• 🔹 Correct output:
• "Dear John,
• Best regards,
Alice"
❌ Computationally expensive.
4. Discriminative Sequence Classification Methods
• These methods classify entire sequences rather than individual words, allowing the model to learn
context better.
• A simple rule-based approach might fail to split correctly. An LSTM-based model learns sentence
• 🔹 Correct segmentation:
• "Stock markets rose today due to positive earnings reports. Experts predict further growth.
• Meanwhile, in sports, the local football team won their championship game."
• 🔹 Correct segmentation:
✅ "Stock markets rose today due to positive earnings reports. Experts predict further growth." ✅
"Meanwhile, in sports, the local football team won their championship game."`
• The complexity of different approaches varies based on time, memory, training, prediction, and
• 📌 Example:
• CRF (Discriminative Model) uses word features + POS tags + punctuation but is slower due to feature
extraction.
Local vs. Sequence-Based Approaches
• Local Approaches (Rule-based, SVMs, Decision Trees)
– 🟢 Faster (only analyzes single sentences).
• 📌 Example:
context).
• Sequence Approach: Uses previous and next sentences for better accuracy
(slower).
Polynomial vs. Exponential Complexity
• Dynamic programming helps sequence-based models run in
Example:
(faster)
Performance of Approaches
Evaluation Metrics
Example:
• A rule-based system for sentence segmentation in speech may have a higher error
• A deep learning system (LSTMs) may have a lower F1-score if trained on limited
data
Performance Comparison in Text Segmentation
• Mikheev’s Rule-Based Model: Error rate = 1.41%.
Key Takeaway:
• Supervised ML (SVMs, CRFs) outperforms rule-based methods.
Simple structures,
Rule-Based Low Fast Moderate legal documents
CRFs (Sequence- Very High Slowest Very High Complex NLP tasks
based) (NER, POS tagging)
Large-scale NLP
Deep Learning Highest Slowest Best (summarization,
(LSTMs, BERT) translation)