0% found this document useful (0 votes)

27 views17 pages

NLP UNIT-I Part-II

The document discusses the importance of document structuring in Natural Language Generation, focusing on Sentence Boundary Detection (SBD) and Topic Boundary Detection (TBD). It outlines various methods for these tasks, including rule-based, statistical, and machine learning approaches, highlighting their complexities and performance metrics. The document concludes that supervised machine learning methods generally outperform rule-based systems in accuracy and efficiency for text segmentation tasks.

Uploaded by

hodcsm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views17 pages

NLP UNIT-I Part-II

Uploaded by

hodcsm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 17

Finding the Structure of Documents

• Document Structuring is a key subtask of Natural

Language Generation (NLG).
• It focuses on organizing information into a logical sequence,
including deciding sentence order, grouping text into
paragraphs, and structuring content flow. It is closely
related to Content Determination, which involves selecting
the information to be included in the generated text.
Two critical components of document structuring are:
1. Sentence Boundary Detection
2. Topic Boundary Detection
1. Sentence Boundary Detection (SBD)
• Sentence Boundary Detection (SBD) is the process of identifying the end of a sentence in a given text. It is

crucial in NLP applications such as text summarization, machine translation, and speech-to-text processing.

• Challenges in Sentence Boundary Detection

• SBD is not as simple as detecting periods (.) because abbreviations, numbers, and formatting variations can

cause confusion.

• Example of Sentence Boundary Ambiguity

Case 1: Abbreviations

• Incorrect detection:

• Dr. John is an expert in NLP. He has worked at Google Inc. since 2015.

• A naive SBD system might incorrectly split after "Dr." and "Inc.", assuming they are sentence boundaries.

• Correct detection:

• Dr. John is an expert in NLP.

He has worked at Google Inc. since 2015.

• To correctly handle such cases, machine learning models or rule-based systems (such as regular expressions) are

used.
Sentence Boundary Detection
Case 2: Numerical Values and Dates

Incorrect detection:

• The temperature in New York was 23.5 degrees yesterday. It will be lower today.

• A simple rule-based system might mistakenly treat "23.5" as a sentence break.

Correct detection:

• The temperature in New York was 23.5 degrees yesterday.

It will be lower today.

Techniques for Sentence Boundary Detection

• Rule-Based Methods – Use regular expressions to identify punctuation patterns.

• Statistical Methods – Use Hidden Markov Models (HMMs) to learn sentence-ending probabilities.

• Machine Learning Methods – Train classifiers like Naïve Bayes, Decision Trees, or Deep Learning

to distinguish sentence boundaries.

2. Topic Boundary Detection (TBD)
• Topic Boundary Detection (TBD) identifies where one topic
ends and another begins in a document. This is crucial for
document summarization, information retrieval, and text
segmentation.
Challenges in Topic Boundary Detection
• Detecting topic changes is difficult because topics can shift
gradually or abruptly, depending on the writing style.
Example of Topic Boundary Changes
Case 1: News Article

• Consider a news report with the following paragraphs:

• The stock market opened higher today, with major indices gaining points. Experts attribute the rise to

positive earnings reports.

• Meanwhile, in sports, the local football team secured a victory against their rivals, thrilling fans.

• A Topic Boundary Detection system should recognize that "Stock Market" and "Sports" are separate

topics.

Case 2: Research Paper

• A research paper might have the following sections:

• Introduction – Defines the problem and motivation.

Related Work – Discusses previous research.

Methodology – Explains the approach used.

Results and Discussion – Presents findings and insights.

• A TBD system must correctly segment these sections.

Methods
• Document structuring and sentence segmentation involve
techniques to determine sentence boundaries, topic
boundaries, and overall structure in a text. Various machine
learning approaches are used to accomplish this task.
The key methods include:
1. Generative Sequence Classification Methods
2. Discriminative Local Classification Methods
3. Hybrid Approaches
4. Discriminative Sequence Classification Methods
5. Extensions for Global Modeling for Sentence Segmentation
1. Generative Sequence Classification Methods
• These methods use probabilistic models that learn the joint probability of words and their corresponding labels

(sentence boundaries or topic changes). One of the most common generative models is the Hidden Markov Model

(HMM).

Example: Hidden Markov Model (HMM) for Sentence Boundary Detection

• Consider this text:

• "Dr. Smith is an expert in AI. He works at Google Inc. in California."

• A naïve rule-based system might incorrectly split after "Dr." or "Inc.". An HMM-based model assigns probability

scores to whether a word ends a sentence or not.

How it works

• States: Sentence boundary (B), non-boundary (NB).

• Observations: Words, punctuation, capitalization.

• Transition probabilities: P(NB → B), P(B → B), etc.

• 🔹 Correct output (after HMM analysis):

✅ "Dr. Smith is an expert in AI. | He works at Google Inc. in California."

• Pros & Cons

• ✅ Simple and interpretable.

❌ Cannot capture deep semantic relationships.

2. Discriminative Local Classification Methods
• Unlike generative models, discriminative models learn decision boundaries to classify each

punctuation mark as a sentence boundary (B) or non-boundary (NB).

Example: SVM or Logistic Regression for Sentence Segmentation

• Consider the sentence:

• "New York is beautiful. The weather is great!"

• A local classifier takes each punctuation mark (. or !) and decides whether it marks a sentence boundary.

• Features used

• Previous and next words: "beautiful", "The".

• Punctuation type: . or !.

• Capitalization of the next word (The is capitalized → likely a new sentence).

• 🔹 Correct output:

✅ "New York is beautiful. | The weather is great!"

3. Hybrid Approaches
• Hybrid models combine generative and discriminative methods for better accuracy. A common hybrid approach is

Conditional Random Fields (CRF) + Neural Networks.

Example: CRF for Email Segmentation

• Consider an email structure:

• "Dear John,

• I hope you're doing well.

• Best regards,

Alice"

A CRF-based model considers features like:

• Line breaks (indicating new sections).

• Greetings (Dear) and signatures (Best regards).

• Word embeddings to detect sentence importance.

• 🔹 Correct output (segmented email):

✅ "Dear John, | I hope you're doing well. | Best regards, Alice"

• Pros & Cons

• ✅ More accurate than pure rule-based methods.

❌ Computationally expensive.
4. Discriminative Sequence Classification Methods
• These methods classify entire sequences rather than individual words, allowing the model to learn

context better.

• Example: LSTM for Sentence Segmentation in Chat Messages

• Consider a text message conversation:

• "hey how are you? i am fine thanks. what about you?"

• A simple rule-based approach might fail to split correctly. An LSTM-based model learns sentence

structure based on word embeddings and contextual dependencies.

• 🔹 Correct segmentation:

✅ "Hey, how are you? | I am fine, thanks. | What about you?"

• Why is LSTM better?

• It remembers previous words, helping in cases like:

✅ "I saw Mr. Brown today. He looked happy."

(Avoids breaking after "Mr.").

• Pros & Cons

• ✅ Handles long-range dependencies well.

❌ Needs large training datasets.

5. Extensions for Global Modeling for Sentence Segmentation
• These methods consider long documents and optimize for paragraph and document structuring.

• Example: Hierarchical Attention Network (HAN) for News Article Structuring

• Consider a news article:

• "Stock markets rose today due to positive earnings reports. Experts predict further growth.

• Meanwhile, in sports, the local football team won their championship game."

• A Hierarchical Attention Network (HAN):

• First analyzes words within sentences.

• Then analyzes sentences to determine topic boundaries.

• 🔹 Correct segmentation:

✅ "Stock markets rose today due to positive earnings reports. Experts predict further growth." ✅

"Meanwhile, in sports, the local football team won their championship game."`

• Pros & Cons

• ✅ Best for long documents and paragraph segmentation.

❌ Requires high computational power.

Complexity of the Approaches

• The complexity of different approaches varies based on time, memory, training, prediction, and

feature extraction. Here’s a summary:

• 1.1. Discriminative vs. Generative Models

• Discriminative Approaches (e.g., CRFs, SVMs, Neural Networks)

– 🔴 Higher training complexity (requires multiple passes over data).

– 🔴 Slower inference (feature extraction is costly).

– 🟢 Performs well with fewer training samples.

– 🟢 Handles diverse feature sets (e.g., words, POS tags, punctuation).

• Generative Approaches (e.g., HMMs, Naïve Bayes, HELMs)

– 🟢 Handles large datasets efficiently (e.g., decades of news transcripts).

– 🟢 Faster prediction (fewer features, simpler models).

– 🔴 Poor at handling unseen events (limited feature set).

• 📌 Example:

• HMM (Generative Model) predicts sentence boundaries using word probabilities.

• CRF (Discriminative Model) uses word features + POS tags + punctuation but is slower due to feature

extraction.
Local vs. Sequence-Based Approaches
• Local Approaches (Rule-based, SVMs, Decision Trees)
– 🟢 Faster (only analyzes single sentences).

– 🔴 Less accurate (misses dependencies between sentences).

• Sequence-Based Approaches (HMMs, CRFs, LSTMs)

– 🔴 Complex due to decoding (evaluates multiple sequences).

– 🟢 More accurate (captures dependencies across sentences).

• 📌 Example:

• Local Approach: Classifies each sentence independently (faster, but ignores

context).

• Sequence Approach: Uses previous and next sentences for better accuracy

(slower).
Polynomial vs. Exponential Complexity
• Dynamic programming helps sequence-based models run in

polynomial time instead of exponential.

• Complexity grows exponentially with:

– Number of boundary candidates.

– Number of sentence boundary states.

Example:

• CRF training complexity: Requires multiple inference passes on

training data (expensive).

• HMM training complexity: Uses simple probability calculations

(faster)
Performance of Approaches

• Performance evaluation depends on accuracy, error rate, F1-score, and recall.

Evaluation Metrics

• Error Rate = (Number of errors) ÷ (Total sentences).

• F1-score = 2 × (Precision × Recall) ÷ (Precision + Recall).

• NIST Error Rate = (Wrong labels) ÷ (Actual boundaries).

Example:

• A rule-based system for sentence segmentation in speech may have a higher error

rate due to speech ambiguities.

• A deep learning system (LSTMs) may have a lower F1-score if trained on limited

data
Performance Comparison in Text Segmentation
• Mikheev’s Rule-Based Model: Error rate = 1.41%.

• With Abbreviation List: Error rate = 0.45%.

• With POS-based Classifier: Error rate = 0.31%.

• Gillick’s SVM-based Model: Error rate = 0.25% (best

performance).

Key Takeaway:
• Supervised ML (SVMs, CRFs) outperforms rule-based methods.

• Sentence segmentation errors affect subsequent NLP tasks

(e.g., summarization).
Summary Table
Approach Training Prediction Speed Accuracy Best Use Cases
Complexity

Simple structures,
Rule-Based Low Fast Moderate legal documents

HMM Medium Fast Moderate Speech

(Generative) segmentation

SVMs Text classification,

(Discriminative) High Slow High sentence
segmentation

CRFs (Sequence- Very High Slowest Very High Complex NLP tasks
based) (NER, POS tagging)

Large-scale NLP
Deep Learning Highest Slowest Best (summarization,
(LSTMs, BERT) translation)

3 - Unit - 1 - Find Structures of Documents
No ratings yet
3 - Unit - 1 - Find Structures of Documents
39 pages
Natural Language Processing Notes
No ratings yet
Natural Language Processing Notes
61 pages
NLP Book
No ratings yet
NLP Book
599 pages
Unit III 1
No ratings yet
Unit III 1
11 pages
Speech and Language Processing - J&M
No ratings yet
Speech and Language Processing - J&M
599 pages
Ed 3 Book
No ratings yet
Ed 3 Book
577 pages
Schematron: A language for validating XML
From Everand
Schematron: A language for validating XML
Erik Siegel
No ratings yet
Suvanto Elmeri Gen Ai
No ratings yet
Suvanto Elmeri Gen Ai
41 pages
Daily Dose of Data Science
No ratings yet
Daily Dose of Data Science
290 pages
Seminar Title: Natural Language Processing: Understanding and Generating Human Language
No ratings yet
Seminar Title: Natural Language Processing: Understanding and Generating Human Language
20 pages
Unit 3 NLP New
No ratings yet
Unit 3 NLP New
15 pages
2A739 Liu y Structural Event Detection For Rich Transcription of S
No ratings yet
2A739 Liu y Structural Event Detection For Rich Transcription of S
253 pages
Eisenstein
No ratings yet
Eisenstein
305 pages
NLP JNTUH Unit 3
No ratings yet
NLP JNTUH Unit 3
19 pages
Vol55 Ullrich
No ratings yet
Vol55 Ullrich
56 pages
ML All Chapter
No ratings yet
ML All Chapter
118 pages
G.H Patel College of Engineering and Technology: Text Analysis, Summarization and Extraction
No ratings yet
G.H Patel College of Engineering and Technology: Text Analysis, Summarization and Extraction
98 pages
Mod 1
No ratings yet
Mod 1
71 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
NLP m2
No ratings yet
NLP m2
71 pages
CM20315 01 Intro
No ratings yet
CM20315 01 Intro
62 pages
Human Language Technology Challenges For Computer Science and Linguistics Zygmunt Vetulani Download
No ratings yet
Human Language Technology Challenges For Computer Science and Linguistics Zygmunt Vetulani Download
62 pages
Autoencoders - Buffalo University
No ratings yet
Autoencoders - Buffalo University
36 pages
Ai CH 4
No ratings yet
Ai CH 4
53 pages
Generative Artificial Intelligence Exploring The Power and Potential of Generative AI 1st Edition Shivam R Solanki Instant Download
No ratings yet
Generative Artificial Intelligence Exploring The Power and Potential of Generative AI 1st Edition Shivam R Solanki Instant Download
51 pages
ML For NLP-LO3
No ratings yet
ML For NLP-LO3
61 pages
On The Applicability of Deep Learning To Construct Process Models From Natural Text 16 05
No ratings yet
On The Applicability of Deep Learning To Construct Process Models From Natural Text 16 05
66 pages
Unit - V
No ratings yet
Unit - V
44 pages
Air V2i4p101
No ratings yet
Air V2i4p101
16 pages
7-Text Classification-13-11-2024
No ratings yet
7-Text Classification-13-11-2024
53 pages
Conditional Random Field Model (CRF)
No ratings yet
Conditional Random Field Model (CRF)
31 pages
A Complete Process of Text Classification System Using State of The Art NLP Models
No ratings yet
A Complete Process of Text Classification System Using State of The Art NLP Models
26 pages
Unit - 4
No ratings yet
Unit - 4
26 pages
Module2 ML 22 01 2024 WM
No ratings yet
Module2 ML 22 01 2024 WM
42 pages
Final Ojt
No ratings yet
Final Ojt
31 pages
NLP Questions
No ratings yet
NLP Questions
26 pages
Machine Learning
No ratings yet
Machine Learning
57 pages
MLOps Brochure BITS
No ratings yet
MLOps Brochure BITS
27 pages
Session 2 Introduction To Generative AI
No ratings yet
Session 2 Introduction To Generative AI
17 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
ML Unit 1 Solution
No ratings yet
ML Unit 1 Solution
18 pages
Denoising Diffusion Restoration Models
No ratings yet
Denoising Diffusion Restoration Models
32 pages
NLP Unit 1 Part 2
No ratings yet
NLP Unit 1 Part 2
14 pages
Privacy and Security Concerns in Generative AI A Comprehensive Survey
No ratings yet
Privacy and Security Concerns in Generative AI A Comprehensive Survey
19 pages
NLP 1.1
No ratings yet
NLP 1.1
20 pages
Module2.4 Text Processing
No ratings yet
Module2.4 Text Processing
17 pages
Natural Language Processing
No ratings yet
Natural Language Processing
27 pages
The Perfect Chatbot
No ratings yet
The Perfect Chatbot
11 pages
Stable Diffusion For Image Generation
No ratings yet
Stable Diffusion For Image Generation
23 pages
Sentence Segmentation
No ratings yet
Sentence Segmentation
19 pages
Unit 5 NLP
No ratings yet
Unit 5 NLP
24 pages
NLP Final
No ratings yet
NLP Final
33 pages
NLP Short Que Ans
No ratings yet
NLP Short Que Ans
21 pages
Hierarchical Graph-Based Text Classification Framework With Contextual
No ratings yet
Hierarchical Graph-Based Text Classification Framework With Contextual
18 pages
NLPAssignment Purna
No ratings yet
NLPAssignment Purna
12 pages
Unit 1 NLP and TA
No ratings yet
Unit 1 NLP and TA
9 pages
NLP Unit1
No ratings yet
NLP Unit1
24 pages
Providing Accurate Data:: How Does It Work?
No ratings yet
Providing Accurate Data:: How Does It Work?
9 pages
UNIT 4 New
No ratings yet
UNIT 4 New
14 pages
Report Group-8
No ratings yet
Report Group-8
16 pages
Vision Transformer-Based Feature Extraction For Generalized Zero-Shot Learning
No ratings yet
Vision Transformer-Based Feature Extraction For Generalized Zero-Shot Learning
21 pages
Natural Language Processing Internal 1
No ratings yet
Natural Language Processing Internal 1
18 pages
Gen AI Notes Part 1
No ratings yet
Gen AI Notes Part 1
15 pages
TC6 PROJECT SYNOPSIS KrishShetty VedantLandge 231106 101402
No ratings yet
TC6 PROJECT SYNOPSIS KrishShetty VedantLandge 231106 101402
13 pages
Assigmnent I TEXT WEB Media (2024 Feb)
No ratings yet
Assigmnent I TEXT WEB Media (2024 Feb)
12 pages
Complete NLP Guide - From Fundamentals To Deep Learning With TensorFlow
No ratings yet
Complete NLP Guide - From Fundamentals To Deep Learning With TensorFlow
13 pages
Unraveling The Power of Natural Language Processing
No ratings yet
Unraveling The Power of Natural Language Processing
11 pages
NLP Sem Unit 5
No ratings yet
NLP Sem Unit 5
9 pages
What Is Machine Learning - Qifang Bi, Katherine E. Goodman, Joshua Kaminsky, and Justin Lessler
No ratings yet
What Is Machine Learning - Qifang Bi, Katherine E. Goodman, Joshua Kaminsky, and Justin Lessler
18 pages
Mini Project Report
No ratings yet
Mini Project Report
26 pages
Doyle 2014 Art Talk
No ratings yet
Doyle 2014 Art Talk
29 pages
Cambria 2017
No ratings yet
Cambria 2017
7 pages
NLP MCQs
No ratings yet
NLP MCQs
15 pages
Brolly AI - Generative AI - Online Training
No ratings yet
Brolly AI - Generative AI - Online Training
13 pages
A Survey of Deep Learning Approaches For OCR and D
No ratings yet
A Survey of Deep Learning Approaches For OCR and D
14 pages
Generative Ai and Securitycoursedetailing
No ratings yet
Generative Ai and Securitycoursedetailing
6 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Parameter Prediction of Coiled Tubing Drilling Based On GAN-LSTM
No ratings yet
Parameter Prediction of Coiled Tubing Drilling Based On GAN-LSTM
10 pages
News Classsification
No ratings yet
News Classsification
11 pages
Deep Convolutional Generative Adversarial Network Based Food Recognition
No ratings yet
Deep Convolutional Generative Adversarial Network Based Food Recognition
4 pages
Few-Shot Unlearning
No ratings yet
Few-Shot Unlearning
6 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
Natural Language Processing - NOTES
No ratings yet
Natural Language Processing - NOTES
4 pages
Sample
No ratings yet
Sample
8 pages
127 1498038923 - 21-06-2017 PDF
No ratings yet
127 1498038923 - 21-06-2017 PDF
9 pages
Generative Adversarial Nets
No ratings yet
Generative Adversarial Nets
9 pages
The Generative Pre-Trained Transformer: GPT-3
No ratings yet
The Generative Pre-Trained Transformer: GPT-3
1 page
Introduction To NLP - First - Week - Lecture - 1st
No ratings yet
Introduction To NLP - First - Week - Lecture - 1st
6 pages
Machine Learning: Cognate/ Elective 2
No ratings yet
Machine Learning: Cognate/ Elective 2
46 pages
Simple guide to start a thesis
From Everand
Simple guide to start a thesis
lady rodriguez
No ratings yet